Hans Tuscher
Hans Tuscher

Syncing millions of products to Shopware

Published at Aug 15th 2024
  CTO @ Onedrop
  CTO
  Onedrop
   Regensburg, Germany
 

What are the main challenges in order to import a large catalog with millions of products, categories, languages to Shopware?

There are two primary concerns to address regarding the integration process.

Firstly, it's important to note that Shopware operates exclusively with UUIDs for identifying products. Directly importing products using only a unique SKU is not possible. Should the ID field be left blank, Shopware's API automatically generates a random UUID. This becomes problematic when attempting to update a product by its SKU later, as the operation requires the product's UUID, which may not be initially known. To circumvent this, one could either develop a system to track these UUIDs or perform a lookup for each SKU before synchronization to retrieve the corresponding UUID. However, the latter approach can significantly slow down the synchronization process due to the high volume of requests it generates. A practical workaround is to use a hash function like MD5 on the SKU, as Shopware's hex representation of UUIDs coincidentally aligns with the MD5 format, comprising 32 alphanumeric characters. This method ensures a consistent and identifiable UUID for each product, both during creation and updates. It's worth noting that Shopware is aware of this limitation and is considering enhancements through a new catalog import API, as detailed in an open RFC: Shopware RFC on new catalog import API.

Secondly, the error handling within the synchronization API requires careful consideration. One critical aspect is that a sync request operates as a single database transaction. Consequently, if a batch of 100 items contains even one item with an error, none of the items will be stored. To address this, we implemented a system to identify and remove the erroneous item from the batch and retry the request, preventing the loss of the remaining 99 items. Additionally, the API may return vague or misleading error messages, such as "Malformed UTF-8" due to unknown UUID references, or it may not provide a clear error message at all, instead showing a generic SQL error like An exception occurred while executing 'INSERT INTO ... SQLSTATE[23000]: Integrity constraint violation. These errors can be challenging to interpret and handle effectively. It would be beneficial for Shopware to enhance its error processing to capture and present these errors more clearly.

What are good policies and strategies to keep data synced, in case they come from an external source?

Which strategy is best depends heavily on the requirements, the amount of data and the preconditions of the external source.

1) Regularly Scheduled Syncs: Establish a regular schedule for data synchronization. The frequency should be based on the volatility of the data in the external system and the operational requirements of the Shopware 6 store. For example, stock levels might need to be updated more frequently than product descriptions.

2) Event-Driven Updates: In addition to scheduled syncs, implement event-driven updates for critical data changes. This ensures that important changes, such as price adjustments or stock level changes, are reflected in Shopware 6 almost immediately, enhancing the accuracy of the data presented to the end-users.

3) Incremental Updates: Instead of syncing the entire catalog at every update, employ an incremental update strategy. This means only data that has been added, modified, or deleted since the last synchronization is transferred. Incremental updates reduce the load on both the external system and Shopware 6, improving performance and reducing the risk of errors.

You should also consider the peak hours of your system and exclude them from any sync processes. The user experience and performance of the shop is (in most cases) more important than keeping everything up to date.

What are the things that should never be done while dealing with large catalogs?

When using the Sync-API for transferring large volumes of data to Shopware, it's crucial not to overlook the inclusion of the use-queue-indexing header. Omitting this header means that the API waits for both the data storage and the completion of all indexing processes before signaling success. This requirement significantly extends the response time for each request, from the typical 1-2 seconds to 30-60 seconds, thereby considerably slowing down batch processing. Indexing is more efficiently handled as an asynchronous operation. In certain scenarios, it may be beneficial to entirely bypass indexing during bulk uploads by using the disable-indexing header. Indexing can then be manually initiated afterward through the /api/_action/index endpoint. Adopting this approach enhances control over resource utilization during extensive data operations, ensuring a more efficient management of the system's performance.