Skip to main content

Erasure Coding

Erasure coding is a powerful method for safeguarding data, offering robust protection against partial data loss. This technique involves dividing the original data into multiple fragments and generating extra parity fragments to introduce redundancy. A key advantage of erasure coding is its ability to recover the complete original data even if some fragments are lost. Additionally, it offers the flexibility to customize the level of data loss protection, making it a versatile and reliable choice for preserving data integrity on Swarm. For a more in depth dive into erasure coding on Swarm, see the erasure coding paper from the Swarm research team.

Uploading With Erasure Coding

Erasure coding is available for the /bytes and /bzz endpoints, however it is not available for the /chunks endpoint which deals with single chunks. Since erasure coding relies on splitting data into chunks and the chunk is the smallest unit of data within Swarm which cannot be further subdivided, erasure coding is not applicable for the /chunks endpoint which deals with single chunks.

To upload data to Swarm using erasure coding, the swarm-redundancy-level: <integer> header is used:

    curl \
-X POST http://localhost:1633/bzz?name=test.txt \
-H "swarm-redundancy-level: 1" \
-H "swarm-postage-batch-id: 54ba8e39a4f74ccfc7f903121e4d5d0fc40732b19efef5c8894d1f03bdd0f4c5" \
-H "Content-Type: text/plain" \
--data-binary @test.txt

{"reference":"c02e7d943fbc0e753540f377853b7181227a83e773870847765143681511c97d"}

The accepted values for the swarm-redundancy-level header range from the default of 0 up to 4. Each level corresponds to a different level of data protection, with erasure coding turned off at 0, and at its maximum at 4. Each increasing level provides increasing amount of data redundancy offering greater protection against data loss. Each level has been formulated to guarantee against a certain percentage of chunk retrieval errors, shown in the table below. As long as the error rate is below the expected chunk retrieval rate for the given level, there is a less than 1 in a million chance of failure to retrieve the source data.

Redundancy LevelPseudonymExpected Chunk Retrieval Error Rate
0None0%
1Medium1%
2Strong5%
3Insane10%
4Paranoid50%

Redundancy Level Costs Explained

Erasure encoding is applied to sets of chunks of at most size 128 (including both data chunks and parity chunks). For higher levels of redundancy, the ratio of parity chunks to data chunks increases, increasing the percent cost of the upload compared to uploading at lower levels of redundancy.

In the table below, the percent cost is displayed for each redundancy level. The cost of encrypted uploads is also shown, and is double the cost of un-encrypted uploads.

RedundancyParitiesData ChunksPercentChunks EncryptedPercent Encrypted
Medium91197.6%5915%
Strong2110719.6%5340%
Insane319732%4865%
Paranoid8937240.5%18494%

For larger uploads (where the source data chunks are equal to or greater than the "Data Chunks" for each redundancy level respectively) you can use the percent values shown in the "Percent" column as a general estimate of the percent cost of uploading. If the number of chunks is slightly less than the number shown in the "Data Chunks" column, you can also use the value in the "Percent" column as a good general estimate of the percent cost.

However, if the number of source data chunks are significantly less than the value in the "Data Chunks" column for each respective level, then the percent cost will differ significantly from the one shown in the "Percent" column. For more precise calculations, see the relevant appendix.

Cost Calculator Widget

This calculator takes as input an amount of data and an erasure coding redundancy level, and outputs the number of additional parity chunks required to erasure code that amount of data as well as the increase in cost to upload vs. a non-erasure encoded upload:

Data Size:
Data Unit:
Redundancy Level:
Use Encryption?

Downloading Erasure Encoded Data

For a downloader, the process for downloading a file which has been erasure encoded does not require any changes from the normal download process. There are several options for adjusting the default behaviour for erasure encoded downloads, however there is no need to adjust them.

Default Download Behaviour

Erasure coding retrieval for downloads is enabled by default, so there is no need for a downloader to explicitly enable the feature. The default download behaviour is to use the DATA strategy with fallback enabled. With these settings, first an attempt will be made to download the data chunks only. If any of the data chunks are missing, then the retrieval method will fall back to the RACE strategy (PROX is not currently implemented and so will be skipped). With the RACE strategy, an attempt will be made to download all data and parity chunks, and chunks will continue to be downloaded until enough have been retrieved to reconstruct the original data.

Options

warning

Do not adjust these options unless you know exactly what you are doing. The default settings are the best option for almost all cases.

When downloading erasure encoded data, there are three related headers which may be used: swarm-redundancy-strategy, swarm-redundancy-fallback-mode: <integer>, and swarm-chunk-retrieval-timeout.

  • swarm-redundancy-strategy: This header allows you to set the retrieval strategy for fetching chunks. The accepted values range from 0 to 3. Each number corresponds to a different chunk retrieval strategy. The numbers stand for the NONE, DATA, PROX and RACE strategies respectively which are described in greater detail in the API reference (also see the erasure code paper for even more in-depth descriptions). With each increasing level, there will be a potentially greater bandwidth cost.

    Retrieval Strategies
    1. NONE: This strategy is based on direct retrieval of data chunks without pre-fetching, with parity chunks ignored. No pre-fetching is used (data chunks are fetched sequentially).
    2. DATA: The same as NONE, except that data chunks are pre-fetched (data chunks are fetched in parallel in order to reduce latency).
    3. PROX: For this strategy, the chunks closest (in Kademlia distance) to the node are retrieved first. (Not yet implemented.)
    4. RACE: Initiates requests for all data and parity chunks and continues to retrieve chunks until enough chunks are retrieved that the original data can be reconstructed.
  • swarm-redundancy-fallback-mode: <boolean>: Enables the fallback feature for the redundancy strategies so that if one of the retrieval strategies fails, it will fallback to the more intensive strategy until retrieval is successful or retrieval fails. Default is true.

  • swarm-chunk-retrieval-timeout: <boolean>: Allows you to specify the timeout time for chunk retrieval with a default value of 30 seconds. (This is primarily used by the Bee development team for testing and it's recommended that Bee users do not need to use this option.)

An example download request may look something like this:

    curl -OJL \
-H "swarm-redundancy-strategy: 3" \
-H "swarm-redundancy-fallback-mode: true" \
http://localhost:1633/bzz/c02e7d943fbc0e753540f377853b7181227a83e773870847765143681511c97d/

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0

For this request, the redundancy strategy is set to 3 (RACE), which means that it will initiate a request for all data and parity chunks and continue to retrieve chunks until enough have been retrieved to reconstruct the source data. This is in contrast with the default strategy of DATA where only the data chunks will be retrieved.

However, as noted above, it is recommended to not adjust the default settings for these options, so a typical request would actually look like this (which is the exact same as a normal download without any additional options set):

    curl -OJL http://localhost:1633/bzz/c02e7d943fbc0e753540f377853b7181227a83e773870847765143681511c97d/

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0

This means that there is no need for you to inform downloaders that a file you have uploaded uses erasure coding, as even with the default download behaviour reconstruction of the source file will be attempted if any chunks are missing.