Script task performance

Script tasks in Collibra workflows provide a powerful and flexible mechanism for automating actions. However, as a workflow developer, you must carefully consider the performance implications of your code before deploying it to production.

A workflow that performs well in a testing environment may exhibit different behavior in production due to variations in data volume, complexity, and concurrent workflow executions.

If you experience slowness after deploying a new or updated workflow, analyze whether the workflow changes could have caused the issue. Revert the changes if necessary.

Factors influencing performance

The performance of a script task depends primarily on the complexity of the Groovy script. The following factors contribute to script complexity:

Lines of code: A higher volume of code generally translates to longer execution times.
Data volume and API calls: Processing large datasets or making numerous API calls can significantly impact performance.
Looping constructs:
- for loops iterating over a large number of items.
- while loops with conditions that may never be met or recursive methods that lead to infinite loops.

Additionally, the compilation time affects performance. Before execution, each script task must be compiled into machine-executable instructions. This process introduces some overhead, with longer scripts requiring more time to compile.

To improve performance, the Groovy engine uses a caching mechanism to store compiled scripts for reuse. Cached scripts execute faster than non-cached scripts. However, scripts tend to execute longer under the following conditions:

The first time a workflow runs after deployment.
The first time a workflow runs after an application restart.
When the Java garbage collector clears the Groovy engine cache, typically due to low memory.

The impact of loops on memory usage

By default, script tasks execute synchronously, committing all changes in a single transaction. If an error occurs, the script rolls back to the last committed state, which may include:

The beginning of the workflow.
The last completed user task.
The last completed asynchronous task.

When processing large amounts of data in for loops, all the data is retained in memory until a commit occurs. This is true even if you are iterating through paged API results.

To prevent memory issues in such scenarios, use asynchronous mode. For more information, see Process execution.

However, asynchronous mode introduces additional complexity, particularly for advanced exception handling. If an error occurs, already committed changes are not undone unless you implement code to reverse those changes.

Memory issues can significantly degrade performance, primarily because the system may swap data to disk, which is slower. Additionally, memory problems can impact the compilation time of subsequent script tasks. When the Java garbage collector frees up space by deleting previously compiled scripts, these scripts must be recompiled, further slowing execution.

The impact of using the groovy-lib mechanism on compilation time

If you use the groovy-lib mechanism to create reusable functions across workflows, consider that by default, the entire content of the groovy-lib folder is included in each script task before compilation. Since compilation time increases with the number of lines of code, including unused reusable functions can negatively affect performance.

To mitigate this issue, enable the Don’t attach Groovy libs by default option in Collibra Console. For scripts that require reusable functions, explicitly add a // #importFile statement at the beginning of the script to load the relevant files:

// #importFile resourcePrinter.groovy
// #importFile processDetailsPrinter.groovy

Rules for using #importFile:

The #importFile statement must be the first line in the script, even before the imports section. Any whitespace character at the beginning of the script is ignored.
Whitespace characters are allowed before and after //.
There cannot be any whitespace character between # and importFile.
Whitespace characters are allowed between #importFile and the Groovy file name.
If a referenced Groovy file is not found in the groovy-lib folder, it is silently ignored.

Conclusion

Script tasks can significantly enhance workflow automation in Collibra, but their performance depends on multiple factors, including script complexity, memory usage, and compilation time. By following best practices, such as optimizing loops, using asynchronous mode where appropriate, and managing the groovy-lib mechanism effectively, you can improve workflow performance and ensure smoother execution in production environments.