Upload and Download Excel Files in CKAN with XLoader

For the past few months, the web development team at January Advisors have been collaborating with a public sector client to build a new open data portal. We used CKAN, which is the world’s leading open-source data management system, and widely used by many state and local governments in the United States.

Most of the features that CKAN offers are based on extensions or plugins. For instance, to store data in the database, it uses the datastore plugin. And in order to customize the site user interface, we had to create a new extension.

Originally, storing Excel files on the portal seemed like a simple requirement. But it was a little more complex than we thought. So if you’re struggling with this, I wrote a blog for you.

Why DataPusher fell short

Processing an Excel file means that CKAN parses the data and saves it to the datastore. This task can be completed by two different tools: DataPusher and XLoader. Both services automatically parse and pull the data from the file and push them into CKAN’s datastore.

DataPusher was the original tool created to push data into the datastore. Unlike many CKAN tools, DataPusher is not an extension but its own standalone service that runs separately from the CKAN application. Although there is a DataPusher extension in CKAN, this is a simple communicating extension with the DataPusher API.

When we used DataPusher initially, it presented a few issues. First, it was slow to load data. This is caused by the many steps that occur before parsing the data into PostgreSQL. DataPusher first parses CSV rows, converts to detected column types, then it converts the data to a JSON string, and finally calls an internal function for each batch of rows to reformat them into PostgreSQL INSERT statements.

Another issue we faced was the complexity of DataPusher. For example, it uses ckan-service-provider, a queuing system that is complicated and stores jobs in its own database. DataPusher also uses a separate web server which adds complexity to overall setup and management.

Perhaps most importantly, DataPusher’s automated data type casting can be unreliable due to its design. It guesses the column type by at the values on the first few rows of a dataset. So if a column contains numeric values in the first few rows but there is an empty cell later, this will cause an error in DataPusher, and data will stop loading.

Getting started with XLoader and Excel

We started looking around for better solutions, and came across XLoader. This tool offers solutions to the previously mentioned issues by piping the datasets directly into PostgreSQL using COPY, running in the CKAN server and querying jobs with RQ, as well as loading all data as text and allowing it to convert the column types manually.

Here’s the catch: XLoader doesn’t work with Excel files. So our team had to design a few workarounds.

To allow Excel files to be uploaded to the datastore using XLoader, the file loader.py in the XLoader extension needs to be updated. Most likely this file will be in the following directory: src/ckanext-xloader/ckanext/xloader/

The easiest way to allow Excel uploads is to inject some code in this Python file that checks if the data file’s extension is xlsx and then use the Pandas library together with the Openpyxl engine to convert the Excel file into a CSV file and reuse the existing code that allows CSV uploads.

First, make sure Pandas and Openpyxl are installed:

pip install pandas
pip install openpyxl

After, open the loader.py file and locate the load_table function.

Finally, add the following if statement before the file opens:

if ct == 'XLSX' or format == '.xlsx':
    excel_file = pd.read_excel(table_filepath, engine='openpyxl')   
    table_filepath = table_filepath.split('.')[0] + '.csv'
    excel_file.to_csv(table_filepath, index=None, header=True)
    ct, format = 'CSV', '.csv'

The load_table function should look like this now:

def load_table(table_filepath, resource_id, mimetype='text/csv', logger=None):
  '''Loads an Excel file (or other tabular data recognized by messytables)
     into Datastore and creates indexes.
  '''
 
   # use messytables to determine the header row
   logger.info('Determining column names and types')
   ct = mimetype # XLSX
   format = os.path.splitext(table_filepath)[1]  # filename extension (e.g, .xlsx)
 
   if ct == 'XLSX' or format == '.xlsx':
      excel_file = pd.read_excel(table_filepath, engine='openpyxl')   
      table_filepath = table_filepath.split('.')[0] + '.csv'
      excel_file.to_csv(table_filepath, index=None, header=True)
      ct, format = 'CSV', '.csv'
 
   with open(table_filepath, 'rb') as tmp:
      …

Xloader Excel downloads

To download Excel files in CKAN, it is necessary to add this option on the resource page Download dropdown button as well as add that functionality on a couple datastore files.

First, navigate to the …/package/resource_read.html file within your current user interface CKAN extension and find the dropdown-menu class within a ul HTML tag. There you will find all the format options that are currently displayed in the Download dropdown button. To add an Excel option, insert this new anchor:

<a href="{{h.url_for('datastore.dump', resource_id=res.id, format='xlsx', bom=True)}}"  target="_blank"><span>XLSX</span></a> 

After this addition the list should look something like this:

<ul class="dropdown-menu">
  <li>
     <a href="{{h.url_for('datastore.dump', resource_id=res.id, format='csv', bom=True)}}" target="_blank">
        <span>CSV</span>
     </a>
     <a href="{{h.url_for('datastore.dump', resource_id=res.id, format='xlsx', bom=True)}}" target="_blank">
        <span>XLSX</span>
     </a>
     <a href="{{h.url_for('datastore.dump', resource_id=res.id, format='tsv', bom=True)}}" target="_blank">
        <span>TSV</span>
     </a>
     <a href="{{h.url_for('datastore.dump', resource_id=res.id, format='json')}}" target="_blank">
        <span>JSON</span>
     </a>
     <a href="{{h.url_for('datastore.dump', resource_id=res.id, format='xml')}}" target="_blank">
        <span>XML</span>
     </a>
  </li>
</ul>

Now that the Excel download option exists in the resource page, it is necessary to add this functionality in the datastore files. Navigate to the datastore extension directory …/src/ckan/ckanext/datastore. Before editing a few files in this directory, it is important to confirm that the Python library openpyxl is installed. If you have followed the instructions to upload an Excel file you have already installed this library but if you have not, please make sure you are in your virtual environment and run:

pip install openpyxl

Once openpyxl is installed, open the writer.py file and add the following libraries at the beginning of the file:

import csv
from tempfile import NamedTemporaryFile
from openxyl import Workbook

Now, let’s add an Excel writer function and a class similar to the function/class used for downloading the file in CSV, XML and JSON formats:

@contextmanager
def xlsx_writer(response, fields, name=None, bom=False):
   u'''Context manager for writing UTF-8 XLSX data to response
       :param response: file-file or response-like object for writing data  and headers
       :param fields: list of datastore fields
       :param name: file name (for headers, response-like objects only)
       :param bom: True to include a UTF-8 BOM at the start of the file
   '''
   if hasattr(response, u'headers'):
  	response.headers['Content-disposition'] = b'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet; charset=utf-8'
   	if name:
       	    response.headers['Content-disposition'] = (u'attachment; filename="{name}.xlsx"'.format(name=encode_rfc2231(name)))
   
   yield ExcelWriter(response.stream, fields)
 
 
class ExcelWriter(object):
    def __init__(self, response, fields):
    	self.response = response
    	self.wb = Workbook()
 
    	ws = self.wb.active
    	self.fields = [f['id'] for f in fields]
    	ws.append(self.fields)
 
    def write_records(self, records):
    	ws = self.wb.active
    	for r in records:
            ws.append(r[c] for c in self.fields)
 
    def save_records(self):
    	with NamedTemporaryFile() as tmp:
            self.wb.save(tmp.name)
            tmp.seek(0)
            self.response.write(tmp.read())

Once the Excel writer function and class has been included in the writer.py file, let’s go back to the datastore directory and this time open the blueprint.py file. Add the new xlsx_writer function in the ckanext.datastore.writer imports and also include xlsx in a DUMP_FORMATS

from ckanext.datastore.writer import (
    csv_writer,
    xlsx_writer,
    tsv_writer,
    json_writer,
    xml_writer,
)
...
DUMP_FORMATS = u'csv', u'tsv', u'json', u'xml', u'xlsx'
...

After this, add the if fmt == u’csv’ statement in the function dump_to:

def dump_to(resource_id, ouput, fmt, offset, limit, options, sort, search_params):
    if fmt == u'csv':
        write_factory = csv_writer
    	records_format = u'csv'
    elif fmt == u'xlsx':
    	write_factory = xlsx_writer
    	records_format = u'objects'
    elif fmt == u'tsv':
    	write_factory = tsv_writer
    	records_format = u'tsv'
    elif fmt == u'json':
    	write_factory = json_writer
    	records_format = u'lists'
    elif fmt == u'xml':
    	writer_factory = xml_writer
    	records_format = u'objects'
    ...

Finally, add the if fmt == u’xlsx’ statement to save the Excel records after the while True within start_writer:

with start_writer(result[u'fields']) as wr:
   while True:
   	if limit is not None and limit <= 0:
       	    break
   	records = result[u'records']
   	wr.write_records(records)
   	if records_format == u'objects' or records_format == u'lists':
            if len(records) < paginate_by:
            	break
   	elif not records:
            break
 
   	offset += paginate_by
   	if limit is not None:
            limit -= paginate_by
            if limit <= 0:
           	break
     
   	result = result_page(offset, limit)
 
   if fmt == u'xlsx':
        wr.save_records()
 
datastore.add_url_rule(u'/datastore/dump/<resource_id>', view_func=dump)
datastore.add_url_rule(u'/dataset/<id>/dictionary/<resource_id>', view_func=DictionaryView.as_view(str(u'dictionary')))

After all these changes, the Excel download functionality is ready and users can choose to download this format in the resource file.

Sources

https://tech.datopian.com/load/#what-issues-are-there

https://github.com/ckan/datapusher

https://github.com/ckan/ckanext-xloader

https://github.com/ckan/ckanext-xloader#key-differences-from-datapusher

Gerard Otalora

Gerard is a web developer at January Advisors that specializes in user experience, web interactivity and design. You can find his full bio on LinkedIn