StarFlow: Generating Structured Workflow Outputs From Sketch Images
Abstract
Workflows are a fundamental component of automation in enterprise platforms, enabling the orchestration of tasks, data processing, and system integrations. Despite being widely used, building workflows can be complex, often requiring manual configuration through low-code platforms or visual programming tools. To simplify this process, we explore the use of generative foundation models, particularly vision-language models (VLMs), to automatically generate structured workflows from visual inputs. Translating hand-drawn sketches or computer-generated diagrams into executable workflows is challenging due to the ambiguity of free-form drawings, variations in diagram styles, and the difficulty of inferring execution logic from visual elements. To address this, we introduce StarFlow, a framework for generating structured workflow outputs from sketches using vision-language models. We curate a diverse dataset of workflow diagrams—including synthetic, manually annotated, and real-world samples—to enable robust training and evaluation. We finetune and benchmark multiple vision-language models, conducting a series of ablation studies to analyze the strengths and limitations of our approach. Our results show that finetuning significantly enhances structured workflow generation, outperforming large vision-language models on this task.
1 Introduction
Workflows play a crucial role in automating business processes, orchestrating data flows, and integrating enterprise applications. They enable organizations to streamline operations, reduce manual effort, and enforce business logic across complex systems (ServiceNow, 2025; MuleSoft, a Salesforce Company, 2025; Microsoft, 2025). Despite their ubiquity, workflow creation remains a challenging task, often requiring users to manually configure processes through low-code platforms or visual programming environments. While these tools offer greater accessibility than traditional programming, they still demand a deep understanding of system logic, data dependencies, and execution rules.
An intuitive alternative would be the ability to generate structured workflows directly from visual representations, such as hand-drawn sketches or diagrammatic depictions. However, this problem is inherently difficult due to the ambiguity of free-form sketches, variations in diagramming conventions, and the complexity of extracting structured execution logic from visual elements.

In this work, we introduce StarFlow, a framework designed to generate structured workflow representations from sketch-based inputs using vision-language models (VLMs). Our approach involves curating a diverse dataset comprising synthetic, manually annotated, and real-world workflow diagrams, which we use to finetune multiple vision-language models. To evaluate the performance of our approach, we use a Flow Similarity metric that measures the structural fidelity of generated workflows based on the tree representation of the workflow and tree edit distance. Our experimental results demonstrate that finetuning significantly enhances the ability of VLMs to generate structured workflows, outperforming general-purpose models on this specialized task. Additionally, we find that this end-to-end approach of sketch to workflow generation outperforms a more complex pipeline that aims to decompose workflow generation into multiple steps.
The remainder of this paper is structured as follows: In Section 2, we review related work in code and structured output generation, visual language models, and prior research on workflow automation. In Section 3, we present our methodology, detailing how we construct synthetic workflows, generate corresponding diagram representations, and curate datasets for training and evaluation. Section 4 outlines our experimental setup, including our evaluation approach for workflows, baseline comparisons, model selection, and finetuning strategies. We analyze results in Section 5, providing insights into failure cases, model limitations, and areas where finetuning offers significant improvements. Finally, in Section 6, we conclude with a discussion of our key findings, potential limitations, and directions for future research.
By addressing the limitations of existing workflow creation methods and demonstrating the effectiveness of vision-language models in this domain, our work represents a step toward making workflow automation more intuitive and accessible. Our key contributions are as follows:
-
•
We introduce StarFlow, a framework for converting hand-drawn and digitally rendered workflow sketches into structured representations, enabling seamless workflow automation.
-
•
We build a diverse dataset of workflow diagrams, spanning synthetic, human-annotated, and real-world samples, to enhance training and evaluation.
-
•
We analyze model performance across different image types, orientations, and resolutions, identifying challenges and best-performing configurations.
2 Related Work
2.1 Structured Output and Code Generation
Language models trained on code (e.g. Chen et al., 2021; Li et al., 2023; Roziere et al., 2023; Hui et al., 2024; Zhu et al., 2024) have seen significant advancements in recent years, improving various aspects of software development, including code generation (Nijkamp et al., 2022; Jiang et al., 2024; Rodriguez et al., 2024b), comprehension (Feng et al., 2020; Lu et al., 2021), and translation tasks (Lachaux et al., 2020; Yan et al., 2023). These models utilize large-scale source code datasets (e.g. Kocetkov et al., 2022) to learn programming syntax and semantics, enabling them to produce functional and syntactically correct code snippets from natural language prompts.
Evaluation of code generation models is notoriously difficult. Metrics such as the HumanEval benchmark (Chen et al., 2021) aim to evaluate a model’s ability to generate functionally correct code solutions. Other metrics, such as CodeBLEU (Ren et al., 2020), extend the traditional BLEU score (Papineni et al., 2002) by incorporating code-specific features such as syntax and data flow, offering a more nuanced evaluation of code generation quality. In this work, we draw inspiration from CodeBLEU and introduce a Flow Similarity metric based on tree representation and tree edit distance (Zhang and Shasha, 1989).
2.2 Multimodal Large Language Models
Vision-language models (VLMs) (e.g. Alayrac et al., 2022; Liu et al., 2023; Agrawal et al., 2024; Wang et al., 2024; Dubey et al., 2024) have made significant strides in integrating visual and textual data, enabling more sophisticated multimodal understanding. These models excel at various tasks including image captioning (Lin et al., 2014; Vinyals et al., 2015), visual question answering (Antol et al., 2015; Hudson and Manning, 2019; Yue et al., 2024), and document understanding (Rodriguez et al., 2024a; Tong et al., 2025).
One task that remains challenging for VLMs is to generate code or structured outputs based on a screenshot or diagram (Liu et al., 2022; Shukla et al., 2023; Shi et al., 2025; Rodriguez et al., 2024a; Herrera-Camara and Hammond, 2017). For example, Shi et al. (2025) introduce a benchmark to assess the performance of VLMs on generating code to reproduce charts. Closely related to our work, Liu et al. (2022) come up with a two-step process to generate code from a flowchart via two distinct models, one to extract the structure of the diagram, and the other one to generate executable code from pseudocode. In this paper, we focus on generating structured workflows in JSON format from hand drawn or computed generated sketches.
2.3 Workflow Generation
Recent work on workflow generation from textual inputs has demonstrated significant advancements in the field of automated task planning and execution. Approaches relying on retrieval-augmented generation and task decomposition have been deemed effective in order to solve the problem (Béchard and Ayala, 2024; Bassamzadeh and Methani, 2024; Ayala and Béchard, 2024; Fan et al., 2024). Other notable work include Zeng et al. (2023), who built models to generate workflows on-the-fly for specific applications, Fan et al. (2024) who develop a synthetic data pipeline used to train a workflow generator, and Cai et al. (2023) who built a graphical user interface allowing a user to build and edit a workflow with the assistance of a LLM. In this work, we focus on generating workflows from hand-drawn sketches and computer generated diagrams instead of doing so from textual instructions.
3 Methodology
In this section, we go over the dataset creation process and how we evaluate generated flows. We first present a quick overview of what workflows are. We then discuss how we build synthetic workflows by finding patterns frequently found in ones that appear in the real-world. Finally, we highlight how we programmatically create diagrams for these workflows, and how we use these samples as a basis for the human annotated data.
3.1 The Anatomy of a Workflow
Workflows are automated processes that consist of a sequence of reusable actions that perform operations on a user’s data. Within a workflow, actions are intertwined with flow logic elements, such as conditions and loops, that control the execution of the workflow. A workflow normally includes a trigger that determines when the execution starts. Alternatively, a subflow consists of the same actions and flow logic as a workflow, but does not include a trigger. Subflows are meant to be called by workflows or other subflows, similar to how functions are used in programming languages.
Workflows can be triggered in a variety of ways. For example, a workflow can start after a certain interval of time has passed, when a record has been updated in a given table, or when an email is received, to name a few. The actions found in workflows can also perform a variety of operations on behalf of a user. For example, they can look up a set of records in a given table, make updates to records, send emails, connect to third party APIs, and much more.
3.2 Synthetic Workflow Generation
Real world workflows are often built using a distinct set of design patterns. In order to build our synthetic workflow generation pipeline, we implemented a heuristic that can build workflows using a set of flow logic elements (e.g. IF, ELSE, FOREACH) along with actions and subflows sampled deterministically or randomly based on the pattern. Algorithm 1 presents a simplified look at the code used for creating a workflow following the Scheduled Loop pattern, which performs actions on multiple records at predefined time intervals.
Source | Train | Valid | Test |
---|---|---|---|
Synthetic | 12,376 | 1,000 | 1,000 |
Manual | 3,035 | 333 | 865 |
Digital | 2,613 | 241 | 701 |
Whiteboard | 484 | 40 | 46 |
User Interface | 373 | 116 | 87 |
Total | 18,881 | 1,730 | 2,699 |
After creating the workflows, we generate natural language annotations for each step using a large language model — in our case, we used Llama 3.1 70B Instruct (Dubey et al., 2024). We represent the resulting workflow in JSON format for the VLM to generate. Figure 6 in Appendix A presents an example flow generated using the Scheduled Loop heuristic.
Once the synthetic workflows are generated, we proceed to creating variants of these samples using a variety of methods, thus obtaining workflow diagrams of five different flavors: Synthetic, Manual, Digital, Whiteboard, and User Interface. We describe the generation process of each in the next section.
3.3 Creating Workflow Diagrams
Synthetic workflows are created by programmatically generating a graph representation of each workflow using Graphviz (Ellson et al., 2002), where we randomly modify the orientation of the graph and the way edges are represented. For example, the graph representation of the workflow in Figure 6 is shown in Figure 7(a) (Appendix B).
To create the User Interface workflows, we further render the programmatically generated flows using ServiceNow’s native visualization tool, as illustrated in Figure 7(e) (Appendix B). This offers an alternative representation of the flows within an environment that closely aligns with potential deployment scenarios.
The three workflow types Manual, Digital, and Whiteboard are created by human annotators. We mandated an external vendor to hire human annotators to create flow diagrams based on the synthetically generated ones. The annotators were given these graph representations and were asked to create flow diagrams for each graph sample using either digital tools (Digital), or by drawing the graph on paper (Manual) or on a whiteboard or blackboard (Whiteboard). Details regarding the human annotators can be found in Appendix C. Figures 7(b), 7(c), and 7(d) in Appendix B present examples of workflow diagrams for each of the sample types described above.
For each flow JSON in our dataset, we generate one or more images using the approach described above. We then divide the samples according to the flow JSON, ensuring that no flows are shared between the different dataset splits. The number of samples generated for each sample type can be found in Table 1.
4 Experiments
In this section, we conduct experiments to assess the capabilities of various open-weight and proprietary VLM models on the Sketch-to-Workflow task and its evaluation metrics. Additionally, we examine whether finetuning improves performance on the downstream task.
4.1 Models
We perform our experiments using a variety of frontier models as well as open-weight alternatives. We evaluate the following models proprietary: GPT-4o and GPT-4o-mini (Hurst et al., 2024), Claude-3.7-Sonnet (Anthropic, 2024), Gemini-2.0-Flash (Team et al., 2023). We put these models head-to-head against a set of strong open-weights alternatives, namely Pixtral (Agrawal et al., 2024), LLaMA 3.2 Vision (11B and 90B) (Dubey et al., 2024), Phi-3.5 (Abdin et al., 2024), Phi-4 (Abouelenin et al., 2025), and Qwen2.5-VL (3B, 7B and 72B) (Bai et al., 2025). Additionally, we finetune the smaller variants of the open-weight models and observe the resulting improvements on downstream tasks. Training details for the finetuned models can be found in Appendix D.
4.2 Evaluation of Generated Workflows
Assessing the quality of generated flows presents challenges similar to those in evaluating generated code. In this work, we report four types of metrics that provide a comprehensive evaluation by capturing different aspects of flow generation. The metrics we report are Flow Similarity (), Tree BLEU (), Trigger Match (), and Component Match (). For Flow Similarity, we follow the methodology used in Ayala and Béchard (2024): we decompose generated workflows into trees and compute the tree edit distance using the algorithm from Zhang and Shasha (1989). We normalize the obtained tree edit distance by the number of nodes in each tree to obtain a score between 0 and 1.
(1) |
where , denote the given flow and the reference flow, respectively.
We use a custom weighting scheme that assigns greater weight to changes affecting actions than those affecting inputs. Figure 8 in Appendix E illustrates the tree decomposition derived from the flow JSON defined in Figure 6.
We also use a variant of TreeBLEU (Gui et al., 2025) that leverages our tree decomposition to assess structural hierarchy recall between flows.
(2) |
where denotes the set of 1-height subtrees.
To ensure fairness, we exclude subtrees of hight 1 that are always present — specifically, the Flow Trigger and Flow Components edges — so that empty flows without triggers or components receive a score of zero.
Trigger Match measures the percentage of cases where the model correctly predicts the trigger from the sample. Component Match, on the other hand, computes the intersection between the predicted and target components, normalized by their union. This metric evaluates the model’s ability to predict the correct components in an order-agnostic manner, akin to the bag-of-components metric from Béchard and Ayala (2024). Equation 3 depicts both the Trigger Match and Component Match metrics.
(3) |
where and denote the trigger of the given and reference flows, and and denote the set of components in each flow.
4.3 Sketch to Workflow
In this section, we assess the performance of models listed in Section 4.1 on the task of sketch to workflow generation. We evaluate models that are proprietary and ones that have open weights, across a varied model sizes. Our experiments indicate that (1) most proprietary models perform better than open-weights ones without any domain-specific training, and that (2) finetuning on StarFlow helps open-weights models outperfrom proprietary models. Our results are summarized in Table 2.
FlowSim | FlowSim | TreeBLEU | TreeBLEU | Trigger | Component | |
Model | w/ inputs | no inputs | w/ inputs | no inputs | match | match |
Open-weights Models | ||||||
Qwen-2.5-VL-3B-Instruct (Wang et al., 2024) | 0.410 | 0.384 | 0.360 | 0.329 | 0.027 | 0.201 |
Phi-3.5-Vision-4B-Instruct(Abdin et al., 2024) | 0.364 | 0.346 | 0.337 | 0.295 | 0.079 | 0.193 |
Phi-4-Multimodal-6B-Instruct (Abouelenin et al., 2025) | 0.465 | 0.404 | 0.394 | 0.298 | 0.054 | 0.244 |
Qwen-2.5-VL-7B-Instruct (Wang et al., 2024) | 0.614 | 0.538 | 0.562 | 0.508 | 0.036 | 0.280 |
LLaMA-3.2-11B-Vision- Instruct (Dubey et al., 2024) | 0.466 | 0.435 | 0.416 | 0.382 | 0.075 | 0.239 |
Pixtral-12B (Agrawal et al., 2024) | 0.632 | 0.582 | 0.617 | 0.541 | 0.088 | 0.261 |
Qwen-2.5-VL-72B- Instruct (Wang et al., 2024) | 0.710 | 0.643 | 0.703 | 0.655 | 0.325 | 0.305 |
LLaMA-3.2-90B-Vision-Instruct (Dubey et al., 2024) | 0.687 | 0.603 | 0.681 | 0.627 | 0.328 | 0.286 |
Proprietary Models | ||||||
GPT-4o-Mini (Hurst et al., 2024) | 0.642 | 0.617 | 0.650 | 0.623 | 0.254 | 0.305 |
GPT-4o (Hurst et al., 2024) | 0.786 | 0.707 | 0.794 | 0.718 | 0.282 | 0.317 |
Claude-3.7-Sonnet (Anthropic, 2024) | 0.763 | 0.679 | 0.769 | 0.701 | 0.318 | 0.305 |
Gemini Flash 2.0 (Team et al., 2023) | 0.780 | 0.713 | 0.798 | 0.743 | 0.466 | 0.329 |
Finetuned Models | ||||||
Qwen-2.5-VL-3B-Instruct (Wang et al., 2024) | 0.941 | 0.911 | 0.941 | 0.902 | 0.775 | 0.909 |
Phi-3.5-Vision-4B-Instruct (Abdin et al., 2024) | 0.917 | 0.882 | 0.917 | 0.869 | 0.703 | 0.874 |
Phi-4-Multimodal-6B-Instruct (Abouelenin et al., 2025) | 0.939 | 0.908 | 0.940 | 0.902 | 0.770 | 0.907 |
Qwen-2.5-VL-7B-Instruct (Wang et al., 2024) | 0.957 | 0.927 | 0.956 | 0.920 | 0.819 | 0.934 |
LLaMA-3.2-11B-Vision-Instruct (Dubey et al., 2024) | 0.955 | 0.924 | 0.954 | 0.915 | 0.805 | 0.934 |
Pixtral-12B (Agrawal et al., 2024) | 0.952 | 0.919 | 0.950 | 0.908 | 0.753 | 0.930 |
Our results show that most proprietary models perform well on the workflow generation task. As expected, GPT-4o-mini underperforms compared to larger models, likely due to its smaller size. Among open-weight models, all models from the Qwen2.5-VL family of models perform remarkably well against models of similar sizes. Pixtral is another strong model for its size, nearly matching the performance of the larger Llama variant. In addition, performance trends remain consistent across different evaluation metrics. Across all models, scores for FlowSim and TreeBLEU are closely aligned, whether or not input conditions are considered. Additionally, finetuned models perform strongly on the trigger match TM metric, whereas proprietary and non-finetuned open-weight models lag further behind.
Finetuning significantly improves performance, surpassing all baselines by a substantial margin. In particular, the finetuned version of Qwen-2.5-VL-7B achieves notably high scores compared to all other models, closely followed by Llama 3.2 11B and Pixtral-12B. We hypothesize that finetuned models acquire crucial domain knowledge during training, which proprietary models struggle to replicate without additional external information. For example, when prompted with an image representing a flow for creating a user in Microsoft Azure Active Directory, a proprietary model must infer the type, definition, and scope of the relevant component. If the model predicts a component of type action with definition name create_user in scope ms_azure_active_directory, but the actual answer is a component of type action with definition name create_a_user in scope sn_ms_ad_spoke, it receives a score of zero.
Finetuned models benefit from exposure to such components during training, allowing them to memorize proper naming conventions of different components and improve accuracy. There are several potential ways to mitigate this issue. One approach is to integrate tool calls, enabling the VLM to retrieve relevant components during generation. Another is to incorporate retrieval-augmented generation (RAG) by extracting relevant details directly from images. Alternatively, breaking down the task into smaller subtasks could facilitate more effective retrieval of contextual information, helping to ground VLMs during generation.
4.4 Evaluation by Subpopulation
In this section, we are interested in understanding whether some models have more difficulty generating flows for certain types of images. We perform our analysis amongst three distinct axes: source of sample, orientation of sample, and image size. We perform these experiments on a subset of the models described in Section 4.1.
4.4.1 Source of Sample
In the previous section, we observed that models finetuned with StarFlow generally outperform ones that are not finetuned in workflow generation. One question is whether these models perform better across all types of samples. To answer this question, we evaluate a subset of the models evaluated in Section 4.3 on a stratified version of our dataset. Results are shown in Figure 2.

We find that all models experience a drop in performance on the Manual samples compared to other types of samples, closely followed by Whiteboard samples. Intuitively, these images are the hardest ones to interpret as the model has to read handwritten text in order to properly understand what component to select on order to generate the workflow. On the other hand, we find that User Interface screenshots and Synthetic samples are the easiest samples. Since these samples are rendered automatically, we hypothesize that they are the ones that contain the least amount of ambiguity regarding the execution logic of the workflow. User Interface samples contain more textual information than other types, as the interface interprets the flow and presents additional details about the available triggers and components (see Appendix B). This extra context can make the task easier for models.
4.4.2 Orientation of Sample


Workflow diagrams can be represented horizontally or vertically without changing their meaning. As such, we are interested in assessing whether models are better at interpreting sketches presented from top-to-bottom (portrait) versus left-to-right (landscape). Our criteria for differentiating the two types of samples is based on the aspect ratio of the image. We define samples with images that are twice as wide as they are tall as landscape samples, and the rest as portrait.
Our results, summarized in Table 4, show that all benchmarked models exhibit a slight drop in performance and that this gap is more pronounced for the non-finetuned variant of Pixtral-12B (even as this gap is reduced after finetuning). We hypothesize that part of the difference in performance might be explained by the composition of each split. For example, User Interface samples, which are easier examples (see Section 4.4.1), are largely portrait samples due to the nature of the data collection. The presence of such samples in the Portrait category might skew the results.
4.4.3 Ablation on Image Resolution
We study effect of image resolution on model performance. We split sample images into three categories based on size: small (less than 400k pixels of area), large (more than 1M pixels of area), and medium (in between). We choose these image sizes as boundaries to ensure categories are approximately the same size. We present results in Figure 4.
We observe that GPT-4o and Gemini-2.0-Flash perform better on samples of medium size compared to smaller or larger samples by a non-negligible margin. This trend does not repeat for open-weight models other models as they both perform better in smaller images. It is worth noting that the performance of Pixtral does not further degrade for larger images although Llama does perform worse the larger the image gets. Finally, we observe this trend of degraded performance on larger samples remaining after finetuning, although it is to a lesser extent.
FlowSim | FlowSim | |
Model | no inputs | w/ inputs |
Open-weights | ||
Pixtral 12B | 0.193 0.004 | 0.266 0.003 |
Llama 3.2 11B | 0.057 0.013 | 0.155 0.004 |
Proprietary | ||
GPT-4o | 0.287 0.008 | 0.259 0.010 |
Gemini 2.0 Flash | 0.294 0.009 | 0.279 0.004 |
Finetuned | ||
Pixtral 12B | 0.213 0.004 | 0.180 0.007 |
Llama 3.2 11B | 0.2397 0.0002 | 0.203 0.004 |
FlowSim | FlowSim | |
Model | no input | w/ input |
Sketch Workflow | ||
GPT-4o | 0.786 | 0.707 |
Pixtral-12B | 0.632 | 0.582 |
Pixtral-12B (ft) | 0.952 | 0.919 |
Sketch Summary Outline Workflow | ||
GPT-4o | 0.727 | 0.647 |
Mistral-Nemo-Instruct-2407 | 0.472 | 0.414 |
Mistral-Nemo-Instruct-2407 (ft) | 0.834 | 0.828 |
4.5 Workflows from Different Platforms
We are interested in the capability of our models to generalize to different types of workflow images. In this case, we have workflow screenshots from a different user interface that we want to convert to workflows in JSON. This task can be highly valuable for migrating workflows from one platform or application to another with minimal intervention. Figure 9 shows an example of such workflow with complex execution logic.
To evaluate this task, we manually collect 60 samples from an existing legacy workflow platform and convert them to the same JSON format as the other workflows. Due to the small sample size of our dataset, we evaluate using a small temperature of 0.3 and report average and standard deviation over 3 runs. Results are summarized in Table 3.
We observe a significant drop in performance from all models on this task, including the larger models such as GPT-4o. We observe that the finetuned models seem to generalize poorly to the new images, barely outperforming their base variant on this task. We see that the performance of the finetuned Pixtral model degrades on the Flow Similarity with inputs metric, while seeing a slight bump in performance on the other metric. The base Llama model performs very poorly on the task, while its finetuned variant performs a bit better, but still behind GPT-4o and Gemini.
We note however that this task is more complex than the other ones found in this paper since there might be no direct mapping between nodes seen in the workflow image to the JSON representation as each workflow application has its own particularities, allowing different logical patterns, triggers, components, etc. Moreover, more than one valid resulting workflow might be possible for a given image, rendering our evaluation strategy limited in such cases. Improving our evaluation metric to take execution outcome into account (akin to HumanEval (Chen et al., 2021)) would help mitigate this problem.
4.6 End-to-End vs Task Decomposition
Here, we compare whether our end-to-end baseline approach of sketch to workflow generation can match the performance of a pipeline that decomposes the task into multiple subtasks. Following a methodology that closely matches Ayala and Béchard (2024), we first introduce the task of sketch to workflow summary, which aims to boil down the different actions performed in a given workflow sketch into a natural language summary. Then, we use this summary to first generate a workflow outline from the generated summary, and finally generate inputs for the trigger and each action found in the generated flow outline iteratively. This modular approach allows us to use a more sophisticated approach to sketch to workflow generation that can incorporate search calls to retrieve relevant actions and inputs to include in the final flow. For each of our experiments, we use GPT-4o as the image summarizer and a different model for workflow generation (proprietary, open-weights, or finetuned). Results are presented in Table 4.
We find that decomposing the task into multiple subtasks yields lower results across the tested models. This is most likely due to errors compounding every step of the generation pipeline: every small detail missed by the summarization step will impact the generation of the flow outline, which will itself impact which inputs get populated for each component. Moreover, keeping the task of image to flow generation as a single task can drastically decrease the total latency of the application as the number of total calls to the LLM or VLM are significantly reduced.
5 Error Analysis and Discussion
In this section, we examine the current failure modes of various models in workflow generation. To illustrate this, we present a representative example that highlights the strengths and limitations of each approach. We will use the flow depicted in Figure 5(a) to compare the capabilities of each model qualitatively. For the sake of brevity, we will focus on Llama 3.2 11B, a finetuned variant of that same model, and GPT-4o.
When prompting a non finetuned Llama 3.2 11B model to generate a flow, the model can struggle with some basic things, such as predicting the wrong trigger for the task, and picking an unrelated table. Moreover, the model fails to use flowlogic elements properly, and hallucinates actions completely unrelated to the sketch, such as adding a component to send an email. Resulting flow can be seen in Figure 5(b)
A strong proprietary model like GPT-4o does perform qualitatively better on the task. In Figure 5(c), we observe that the model is able to properly predict the trigger and most of the components without generating unrelated ones. However, we see that the model sometimes struggle with keeping track of the flow execution logic, where it omits an ELSE statement in the flow. The model also encounters difficulties with some fine-grained details in the flow that pertain to the component inputs, such as the table to invoke. In the above example, we see that the model falls back to using a generic activity table when unsure about what the name of the table should be given the provided information. We hypothesize that letting the model make use of tools to retrieve some of this domain specific information to properly populate the flow might help remedy this problem.




Finally, the finetuned variant of Llama 3.2 11B performs better than its counterparts on this example. The model predicts the appropriate flow execution logic along with all relevant components. It is also able to properly predict the right tables for the task as it has seen some data from the same domain during the finetuning phase. However, as noted in Section 4.5, this model still fails to generalize to out-of-distribution samples, which might apply to different image styles as well as different logic patterns found in more complex workflows.
6 Conclusion
In this paper, we presented StarFlow, a framework for structured workflow generation from sketch-based diagrams. By leveraging vision-language models and a diverse dataset, we demonstrated that finetuned models outperform general-purpose models in accurately translating sketches into structured workflow representations. Our experiments revealed key insights into the challenges posed by different sketch sources, orientations, and image resolutions, underscoring the importance of domain-specific training.
While our approach shows strong performance in workflow generation, future work could explore extending the methodology to broader workflow visualization styles and improving robustness to handwritten annotations. Additionally, refining evaluation metrics to consider functional execution correctness could provide a more comprehensive assessment of generated workflows. Finally, augmenting models with external information via retrieval-augmented generation or function calling might help better ground the models in generating accurate information in the workflows. Overall, StarFlow represents a step toward making workflow automation more accessible and intuitive by enabling seamless sketch-to-workflow generation.
References
- Abdin et al. [2024] Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone, 2024. URL https://arxiv. org/abs/2404.14219, 2024.
- Abouelenin et al. [2025] Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras. arXiv preprint arXiv:2503.01743, 2025.
- Agrawal et al. [2024] Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Baptiste Bout, Devendra Chaplot, Jessica Chudnovsky, Diogo Costa, Baudouin De Monicault, Saurabh Garg, Theophile Gervet, et al. Pixtral 12b. arXiv preprint arXiv:2410.07073, 2024.
- Alayrac et al. [2022] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
- Anthropic [2024] Anthropic. Claude 3.7 Sonnet System Card, 2024. URL https://assets.anthropic.com/m/785e231869ea8b3b/original/claude-3-7-sonnet-system-card.pdf. Accessed: 2025-03-04.
- Antol et al. [2015] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
- Ayala and Béchard [2024] Orlando Marquez Ayala and Patrice Béchard. Generating a low-code complete workflow via task decomposition and rag. arXiv preprint arXiv:2412.00239, 2024.
- Bai et al. [2025] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025.
- Bassamzadeh and Methani [2024] Nastaran Bassamzadeh and Chhaya Methani. A comparative study of dsl code generation: Fine-tuning vs. optimized retrieval augmentation. arXiv preprint arXiv:2407.02742, 2024.
- Béchard and Ayala [2024] Patrice Béchard and Orlando Marquez Ayala. Reducing hallucination in structured outputs via retrieval-augmented generation. arXiv preprint arXiv:2404.08189, 2024.
- Cai et al. [2023] Yuzhe Cai, Shaoguang Mao, Wenshan Wu, Zehua Wang, Yaobo Liang, Tao Ge, Chenfei Wu, Wang You, Ting Song, Yan Xia, et al. Low-code llm: Graphical user interface over large language models. arXiv preprint arXiv:2304.08103, 2023.
- Chen et al. [2021] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Dubey et al. [2024] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- Ellson et al. [2002] John Ellson, Emden Gansner, Lefteris Koutsofios, Stephen C North, and Gordon Woodhull. Graphviz—open source graph drawing tools. In Graph Drawing: 9th International Symposium, GD 2001 Vienna, Austria, September 23–26, 2001 Revised Papers 9, pages 483–484. Springer, 2002.
- Fan et al. [2024] Shengda Fan, Xin Cong, Yuepeng Fu, Zhong Zhang, Shuyan Zhang, Yuanwei Liu, Yesai Wu, Yankai Lin, Zhiyuan Liu, and Maosong Sun. Workflowllm: Enhancing workflow orchestration capability of large language models. arXiv preprint arXiv:2411.05451, 2024.
- Feng et al. [2020] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155, 2020.
- Gui et al. [2025] Yi Gui, Zhen Li, Yao Wan, Yemin Shi, Hongyu Zhang, Yi Su, Bohua Chen, Dongping Chen, Siyuan Wu, Xing Zhou, et al. Webcode2m: A real-world dataset for code generation from webpage designs. In THE WEB CONFERENCE 2025, 2025.
- Herrera-Camara and Hammond [2017] Jorge-Ivan Herrera-Camara and Tracy Hammond. Flow2code: from hand-drawn flowcharts to code execution. In Proceedings of the Symposium on Sketch-Based Interfaces and Modeling, pages 1–13, 2017.
- Hudson and Manning [2019] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
- Hui et al. [2024] Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186, 2024.
- Hurst et al. [2024] Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024.
- Jiang et al. [2024] Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation. arXiv preprint arXiv:2406.00515, 2024.
- Kocetkov et al. [2022] Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, et al. The stack: 3 tb of permissively licensed source code. arXiv preprint arXiv:2211.15533, 2022.
- Lachaux et al. [2020] Marie-Anne Lachaux, Baptiste Roziere, Lowik Chanussot, and Guillaume Lample. Unsupervised translation of programming languages. arXiv preprint arXiv:2006.03511, 2020.
- Li et al. [2023] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
- Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pages 740–755. Springer, 2014.
- Liu et al. [2023] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023.
- Liu et al. [2022] Zejie Liu, Xiaoyu Hu, Deyu Zhou, Lin Li, Xu Zhang, and Yanzheng Xiang. Code generation from flowcharts with texts: A benchmark dataset and an approach. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6069–6077, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.449. URL https://aclanthology.org/2022.findings-emnlp.449/.
- Loshchilov and Hutter [2017] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Lu et al. [2021] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664, 2021.
- Microsoft [2025] Microsoft. Power Automate - Microsoft Power Platform, 2025. URL https://www.microsoft.com/en-us/power-platform/products/power-automate. Accessed: 2025-03-05.
- MuleSoft, a Salesforce Company [2025] MuleSoft, a Salesforce Company. Automation with MuleSoft, 2025. URL https://www.salesforce.com/mulesoft/automation/. Accessed: 2025-03-05.
- Nijkamp et al. [2022] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474, 2022.
- Papineni et al. [2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
- Rajbhandari et al. [2020] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
- Ren et al. [2020] Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:2009.10297, 2020.
- Rodriguez et al. [2024a] Juan Rodriguez, Xiangru Jian, Siba Smarak Panigrahi, Tianyu Zhang, Aarash Feizi, Abhay Puri, Akshay Kalkunte, François Savard, Ahmed Masry, Shravan Nayak, et al. Bigdocs: An open and permissively-licensed dataset for training multimodal models on document and code tasks. arXiv preprint arXiv:2412.04626, 2024a.
- Rodriguez et al. [2024b] Juan A. Rodriguez, Abhay Puri, Shubham Agarwal, Issam H. Laradji, Pau Rodriguez, Sai Rajeswar, David Vazquez, Christopher Pal, and Marco Pedersoli. Starvector: Generating scalable vector graphics code from images and text, 2024b. URL https://arxiv.org/abs/2312.11556.
- Roziere et al. [2023] Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- ServiceNow [2025] ServiceNow. Flow Designer - ServiceNow, 2025. URL https://www.servicenow.com/products/platform-flow-designer.html. Accessed: 2025-03-05.
- Shi et al. [2025] Chufan Shi, Cheng Yang, Yaxin Liu, Bo Shui, Junjie Wang, Mohan Jing, Linran XU, Xinyu Zhu, Siheng Li, Yuxiang Zhang, Gongye Liu, Xiaomei Nie, Deng Cai, and Yujiu Yang. Chartmimic: Evaluating LMM’s cross-modal reasoning capability via chart-to-code generation. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=sGpCzsfd1K.
- Shukla et al. [2023] Shreya Shukla, Prajwal Gatti, Yogesh Kumar, Vikash Yadav, and Anand Mishra. Towards making flowchart images machine interpretable. In International Conference on Document Analysis and Recognition, pages 505–521. Springer, 2023.
- Team et al. [2023] Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Tong et al. [2025] Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. Advances in Neural Information Processing Systems, 37:87310–87356, 2025.
- Vinyals et al. [2015] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015.
- Wang et al. [2024] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024.
- Yan et al. [2023] Weixiang Yan, Yuchen Tian, Yunzhe Li, Qian Chen, and Wen Wang. Codetransocean: A comprehensive multilingual benchmark for code translation. arXiv preprint arXiv:2310.04951, 2023.
- Yue et al. [2024] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024.
- Zeng et al. [2023] Zhen Zeng, William Watson, Nicole Cho, Saba Rahimi, Shayleen Reynolds, Tucker Balch, and Manuela Veloso. Flowmind: automatic workflow generation with llms. In Proceedings of the Fourth ACM International Conference on AI in Finance, pages 73–81, 2023.
- Zhang and Shasha [1989] Kaizhong Zhang and Dennis Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM journal on computing, 18(6):1245–1262, 1989.
- Zhu et al. [2024] Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo Gao, Shirong Ma, et al. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. arXiv preprint arXiv:2406.11931, 2024.
Appendix A Example of flow JSON generated by workflow generation heuristic
Appendix B Types of samples





Appendix C Human Annotators
We partnered with a for-profit data labeling company (referred to as the "Vendor") specializing in data curation for AI applications. The annotation process spanned a three-month period, beginning with a pilot phase in the first month. During this phase, we collaborated closely with the vendor’s annotation team, conducting detailed reviews and providing extensive feedback to ensure annotators fully understood the task requirements.
Our dataset was annotated by a dedicated team of 24 professionals based in India. These annotators possessed strong proficiency in technical writing and English, with educational backgrounds primarily in Engineering, Computer Science, and related disciplines. The majority held bachelor’s degrees, while some had advanced degrees in specialized fields. Additionally, they brought prior experience in data labeling, ensuring familiarity with structured annotation tasks.
To ensure the highest standards of annotation quality, a comprehensive quality assurance framework was implemented, requiring each annotation to undergo at least three independent review stages. The process began with an initial annotation conducted by experienced annotators or trainers, followed by a primary quality assurance review, where a specialist assessed accuracy, completeness, and adherence to annotation guidelines. Finally, a secondary review ensured consistency and alignment with evolving project requirements. This structured, multi-tiered approach reinforced annotation quality, minimized inconsistencies, and enhanced dataset reliability.
To uphold ethical labor standards and maintain high annotation quality, all annotators were compensated at rates exceeding fair market wages in their respective countries. This strategy supports the recruitment and retention of highly skilled professionals, fostering long-term engagement and ensuring annotation consistency across the project.
Appendix D Training Details
We use a consistent training setup for all finetuned models presented in this paper. To mitigate overfitting, we applied early stopping based on evaluation loss. The learning rate was initialized at , and we used the AdamW optimizer Loshchilov and Hutter (2017) with values of , weight decay of , and an epsilon value of to ensure numerical stability. The learning rate followed a cosine schedule with a warmup phase of 30 steps. Additionally, we enforced a maximum gradient norm of 1.0 to prevent gradient explosion.
For all finetuning runs, we trained both the language model and the connector components of the VLM while keeping the vision encoder frozen. Each model was trained to support sequences of up to 32k tokens, including both image and text inputs.
We conducted training using 16 NVIDIA H100 80GB GPUs across two nodes. Full Sharded Data Parallel (FSDP) Rajbhandari et al. (2020) was employed without CPU offloading. We also used mixed-precision training with bfloat16 (bf16).
Appendix E Decomposition of a workflow in its Tree representation

Appendix F Example of workflow representation not seen during training
