Symbol PDF Storage

December 2, 2024

In Symbol, it's possible to store a PDF document on chain and then randomly access its pages. In this article, we'll build a proof of concept that illustrates how this could work. In order to showcase the key usage of Symbol, we will make some simplifications around the handling of PDF documents. The trade-off of the simplifications is additional data bloat written to the blockchain.

Before continuing, please review this article about sending Symbol transactions with the Symbol SDK. We'll be using the prepare_and_send_transaction and get_transaction_by_hash functions to simplify the examples in this article.

Conceptual Design

A PDF document is composed of objects linked to each other both directly and indirectly. These objects can include text, images, fonts, etc. A PDF parser extracts all of the objects, and a PDF viewer stitches them together into viewable pages.

There are some shared objects - like an embedded font - that are used across all pages. Other objects - like an image - are scoped to a single page. With a purpose built parser (or library), it would be possible to split up shared and page-scoped objects. After downloading the shared and page-scoped objects for a single page, it would be possible to reconstitute the page exactly.

We can identify and group objects based on how they are used: one group of shared objects and other groups of page-scoped objects per page. We can store the these object groups separately on the Symbol blockchain. Since the size of this data can be large, we will need to break it up into 1024 byte chunks spread across embedded transfer transactions contained within one or more aggregate transactions. In addition, we will need to create a table of contents that is stored within its own aggregate. This will contain a mapping from a textual key to the hashes of the aggregate transactions containing the key's data.

In this article, we make two simplifications to the solution discussed above:

  1. We assume that only shared objects will remain after importing a single page from a PDF document into a new PDF document and clearing that page's contents. In fact, some other cruft objects will remain in the document, so the resulting document's size will be larger.
  2. We assume that a page's contents contains only page-scoped objects. In fact, the contents will include some indirect shared objects.

All code was tested to work with the Bitcoin Whitepaper Most of it should be generally applicable to other PDFs, but some of it might not.

With that in mind, let's begin storing a PDF. We will be using pypdf to read and write PDF documents.

Storing a PDF

PDF Data Extraction

Given a PDF document, we can create a PDF template that will contain all shared objects. First, we create a blank PDF using PdfWriter, and then add the first page from the original PDF. This will import all shared objects and page-one scoped objects from the original PDF. Next, since the contents of the first page are unnecessary, we clear them. Finally, we serialize the new PDF into an in memory stream and return the backing bytes buffer.

⚠️ Simply clearing the first page contents will not remove page-scoped objects. As a result, the template will contain some superfluous objects.

def create_pdf_template(original_pdf):
    # create pdf with a blank page by adding the first page and then clearing its contents
    # this will serve as a template when recreating pdfs
    # (this is the page specific content for bitcoin.pdf and may not generalize to all pdfs)
    template_pdf = PdfWriter()
    template_pdf.add_page(original_pdf.pages[0])
    template_pdf.pages[0]['/Contents'].set_data(bytes())  # pylint: disable=unsubscriptable-object

    memory_stream = io.BytesIO()
    template_pdf.write(memory_stream)
    return memory_stream.getvalue()

Given a page from a PDF document, we can create a PDF summary that will contain all page-scoped objects. Similarly to creating a template, we create a blank PDF using PdfWriter and then add the page from the original PDF. Finally, we retrieve bytes representing the page content.

⚠️ The page contents will contain instantiated indirect references, which are typically shared objects. As a result, the summary will contain some duplicate objects.

def extract_page_content(original_pdf_page):
    # write the page to a new pdf
    memory_stream = io.BytesIO()
    single_page_pdf = PdfWriter()
    single_page_pdf.add_page(original_pdf_page)
    single_page_pdf.write(memory_stream)

    # extract the last object
    # (this is the page specific content for bitcoin.pdf and may not generalize to all pdfs)
    return single_page_pdf.pages[0]['/Contents'].get_data()  # pylint: disable=unsubscriptable-object

Symbol Data Storage

Symbol (mainnet) limits the number of message bytes that can be stored in a transfer transaction to 1024 bytes (MESSAGE_CHUNK_SIZE). In addition, Symbol limits the number of embedded transactions stored in an aggregate transaction to 100 (MAX_TRANSACTIONS_PER_AGGREGATE). The amount of data we want to store in the blockchain is unknown and could be large. We need to be able to store arbitrarily sized data in the Symbol blockchain.

In order to accomplish this, we first take an arbitrary message and split it into chunks no larger than MESSAGE_CHUNK_SIZE. Each chunk is wrapped in an embedded transfer transaction:

def chunk_data_into_transactions(facade, signer_key_pair, recipient_address, message):
    message_index = 0
    embedded_transactions = []
    while message_index < len(message):
        embedded_transaction = facade.transaction_factory.create_embedded({
            'type': 'transfer_transaction_v1',
            'signer_public_key': signer_key_pair.public_key,
            'recipient_address': recipient_address,
            'mosaics': [],
            'message': message[message_index:message_index + MESSAGE_CHUNK_SIZE]
        })
        embedded_transactions.append(embedded_transaction)

        message_index += MESSAGE_CHUNK_SIZE

    return embedded_transactions

Next, we combine the resulting embedded transfer transactions and group them into groups of size MAX_TRANSACTIONS_PER_AGGREGATE. Each of these groups is wrapped in an aggregate (complete) transaction:

async def send_aggregate_transactions(facade, signer_key_pair, embedded_transactions):
    transaction_index = 0
    transaction_hash_tasks = []
    while transaction_index < len(embedded_transactions):
        subset_embedded_transactions = embedded_transactions[transaction_index:transaction_index + MAX_TRANSACTIONS_PER_AGGREGATE]
        merkle_hash = facade.hash_embedded_transactions(subset_embedded_transactions)

        transaction_hash_task = prepare_and_send_transaction(facade, signer_key_pair, {
            'type': 'aggregate_complete_transaction_v2',
            'signer_public_key': signer_key_pair.public_key,
            'transactions_hash': merkle_hash,
            'transactions': subset_embedded_transactions
        }, verbose=False)
        transaction_hash_tasks.append(transaction_hash_task)

        transaction_index += MAX_TRANSACTIONS_PER_AGGREGATE

    return await asyncio.gather(*transaction_hash_tasks)

As an optimization, we call prepare_and_send_transaction asynchronously and store the resulting tasks. We use asyncio.gather to wait for all the tasks to complete. Since prepare_and_send_transaction returns a transaction hash, this function returns an array of transaction hashes.

Finally, putting everything together, we can write a small helper function that will store an arbitrary blob on Symbol:

async def store_blob(facade, signer_key_pair, recipient_address, message):
    print('splitting data into multiple chunks')
    embedded_transactions = chunk_data_into_transactions(facade, signer_key_pair, recipient_address, message)

    print(f'constructing and sending aggregate with {len(embedded_transactions)} embedded transactions to network')
    return await send_aggregate_transactions(facade, signer_key_pair, embedded_transactions)

Table of Contents

In order to support recreating an entire or partial PDF document, we will build a table of contents during the processing of the original PDF. This table of contents is a text file that will be stored within one or more aggregate complete transactions.

Each line of the text file is in the format (template|page\d+): hash(,hash)*. The part left of the colon is the key. template indicates it is an entry for the shared template. page\d+ indicates it is an entry for the page-scoped summary. The part right of the colon is a comma-delimited list of the aggregate complete transactions hashes that contain the relevant data.

The following function can be used to create a table of contents line:

def make_toc_entry(tag, transaction_hashes):
    return f'{tag}: {",".join(str(transaction_hash) for transaction_hash in transaction_hashes)}'

Putting it All Together

First, we need to open the original PDF document using PdfReader:

async def store_pdf_impl(facade, signer_key_pair, recipient_address, pdf_filepath):  # pylint: disable=too-many-locals
    # load the pdf
    toc_lines = []
    with open(pdf_filepath, 'rb') as infile:
        inpdf = PdfReader(infile)

Second, we create the template from the original PDF document, store it in the blockchain and add a table of contents entry:

        # create and store a template
        pdf_template_bytes = create_pdf_template(inpdf)
        template_transaction_hashes = await store_blob(facade, signer_key_pair, recipient_address, pdf_template_bytes)
        toc_lines.append(make_toc_entry('template', template_transaction_hashes))

Third, we process each page of the original PDF document individually. We create a summary for each page, and store it in the blockchain asynchronously. Once all the pages have been stored, we add a table of contents entry for each page:

        # create and store a page part for each page
        page_tasks = []
        for pdf_page in inpdf.pages:
            page_content_bytes = extract_page_content(pdf_page)
            page_tasks.append(store_blob(facade, signer_key_pair, recipient_address, page_content_bytes))

        page_transaction_hashes_groups = await asyncio.gather(*page_tasks)
        for i, page_transaction_hashes in enumerate(page_transaction_hashes_groups):
            toc_lines.append(make_toc_entry(f'page{i+1}', page_transaction_hashes))

Finally, we build the table of contents, and store it in the blockchain:

        toc_contents = '\n'.join(toc_lines)
        toc_transaction_hashes = await store_blob(facade, signer_key_pair, recipient_address, toc_contents.encode('utf8'))
        print(f'stored table of contents as{make_toc_entry("", toc_transaction_hashes)}')
        print(f'***\n{toc_contents}\n***')

Importantly, the toc_transaction_hashes will be needed to retrieve the PDF document, or any part of it, from the blockchain. In practice, except for very large PDF documents, this will usually only contain a single value.

For completeness, the store_pdf_impl helper function can be called as follows:

async def store_pdf():
    facade = SymbolFacade('testnet')
    (signer_key_pair, signer_address) = get_test_account(facade, 1, 'signer')
    await store_pdf_impl(facade, signer_key_pair, signer_address, TEST_PDF_FILEPATH)

Notice that the sender and recipient both refer to the same account. Since there is no value transfer, this is reasonable, but not required.

Retrieving a PDF

Symbol Data Retrieval

Given the transaction hashes of one or more aggregate transactions, we can reconstitute the data stored within them:

async def load_message_contents(transaction_hashes):
    download_tasks = [get_transaction_by_hash(transaction_hash) for transaction_hash in transaction_hashes]
    transaction_jsons = await asyncio.gather(*download_tasks)

    message = bytearray()
    for transaction_json in transaction_jsons:
        for embedded_transaction_json in transaction_json['transaction']['transactions']:
            message += unhexlify(embedded_transaction_json['transaction']['message'])

    return bytes(message)

First, we download all of the transactions (by hash) asynchronously using get_transaction_by_hash. We wait for all the downloads to complete using asyncio.gather. Next, we loop over every embedded transaction in every aggregate transaction. For simplicity, we assume only embedded transfer transactions are present. Finally, we merge the message contents of every embedded transfer transaction. We need to use unhexlify because the REST API returns the message data as a hex encoded string.

Loading the Table of Contents

In order rebuild any part of the stored PDF document, we need to rebuild the table of contents. We can do this given the hashes of the aggregate transactions that store the table of contents:

async def load_toc(toc_transaction_hashes):
    toc_bytes = await load_message_contents(toc_transaction_hashes)

    toc = {}
    for line in toc_bytes.decode('utf8').split('\n'):
        [key, transaction_hashes] = line.split(':')
        toc[key] = [transaction_hash.strip() for transaction_hash in transaction_hashes.split(',')]

    return toc

First, we download the table of contents stored in the blockchain using using load_message_contents. Next, we split the downloaded data into lines because each line corresponds to a single table of contents entry. Finally, we parse each line and store the results in an object with string keys and array (of transaction hashes) values.

PDF Construction

We can rebuild a PDF page given the PDF template and a PDF page summary. The PDF template contains all shared objects, and the PDF page summary contains all page-scoped objects:

async def retrieve_single_page_pdf(toc, template_bytes, page_number):
    page_bytes = await load_message_contents(toc[f'page{page_number + 1}'])

    output_pdf = PdfWriter(io.BytesIO(template_bytes))
    output_pdf.pages[0]['/Contents'].set_data(page_bytes)  # pylint: disable=unsubscriptable-object
    return output_pdf

First, we use the table of contents to look up the hashes of the aggregate transactions that contain the PDF page summary. The table of contents entries are 1-based. Once we have those transaction hashes, we use load_message_contents to download the PDF page summary. Next, in order to create the single page PDF, we create a new PdfWriter around the PDF template. Finally, we replace its single page's contents with the PDF page summary.

Putting it All Together

First, we need to download and build the table of contents object:

async def retrieve_pdf():
    print('downloading table of contents')
    toc = await load_toc([TEST_TOC_HASH])
    print(toc)

Second, we need to download the PDF template":

    print('downloading template')
    template_bytes = await load_message_contents(toc['template'])

Finally, we can recover a single PDF page, and store it to disk by calling retrieve_single_page:

    async def retrieve_single_page(page_number):
        output_pdf = await retrieve_single_page_pdf(toc, template_bytes, page_number)

        with open(SINGLE_PAGE_OUTPUT, 'wb') as outfile:
            output_pdf.write(outfile)

    await retrieve_single_page(2)

Alternatively, we could recover the entire PDF document or a subset of pages:

    async def retrieve_multiple_pages(page_numbers):
        # load all pages
        download_tasks = [retrieve_single_page_pdf(toc, template_bytes, page_number) for page_number in page_numbers]
        single_page_pdfs = await asyncio.gather(*download_tasks)

        output_pdf = single_page_pdfs[0]
        for single_page_pdf in single_page_pdfs[1:]:
            output_pdf.add_page(single_page_pdf.pages[0])  # pylint: disable=unsubscriptable-object

        with open(MULTI_PAGE_OUTPUT, 'wb') as outfile:
            output_pdf.write(outfile)

    await retrieve_multiple_pages(list(range(9)))

Here, we use retrieve_single_page_pdf to construct a single page PDF for each requested page. Then, we merge all of the single page PDFs in order to produce a multi-page PDF.

⚠️ The resulting PDF will likely be (much) larger than the original PDF. This is because add_page does not seem to reuse shared objects across documents. This could be optimized with a more granular PDF reader and writer as discussed in the Conceptual Design.