Shapefile to GeoJSON Converter

Overview

Shapefile to GeoJSON Converter allows users to convert shapefiles to the GeoJSON interchange format, which is more practical and convenient for sharing, visualizing, and processing simple geographical data.

Structure

The shapefile to GeoJSON Converter operates as a pipeline that monitors an input stream layer for new conversion requests and processes them on receipt.

Application Flow Diagram
Figure 1. Application Flow Diagram

Legend: Legend Diagram

External Application

External to the conversion pipeline, you will need some means of generating conversion requests and publishing them to the request stream. In a production workflow, you will likely have one or more applications that write to the request stream. For convenience, we have included a script in the Shapefile to GeoJSON Converter distribution bundle that will perform the following:

  • Publish an input shapefile residing on your local drive to a Versioned layer
  • Create a conversion request which uses this shapefile as input
  • Publish request to the input stream layer so that it will be processed by the pipeline template

See separate README file in the utils folder for details.

Request

Requests are small json objects which contain information the pipeline needs in order to perform a conversion. The request structure is:

{
    "id": "String",
    "input": 
    {
        "catalog": "hrn format, String",
        "layer": "String",
        "partition": "String",
        "layerTypeName":"LayerTypeName enum (Versioned, Volatile)",
        "version": Long
    },
    "output": 
   {
        "catalog": "hrn format, String",
        "tileLevel": Integer
   }
}

The id element is optional. If a value is not given, the pipeline will generate one for you.

The input object is where you specify location of the input shapefile:

  • catalog: hrn of the catalog where shapefile resides
  • layer: id of the layer in above catalog where shapefile resides
  • partition: (optional) partition id where shapefile resides.
  • layerTypeName: Type of layer that shapefile exists in. Allowed types are Versioned and Volatile
  • version: (optional) If layerTypeName is "Versioned", you may specify which version to use. If this value is not given, the latest version will be used by default

The output object is the location where you would like the GeoJSON output to be stored:

  • catalog: hrn of the catalog where output is to be written
  • tileLevel: (optional) If specified, GeoJSON output will be published to partition(s) corresponding to this level. Features which span multiple partitions will be published to the one containing the southwestern-most point of the feature. Valid tile levels are any integer between 1 and 16. If not specified, the level will be chosen based on the spatial extent of your input shapefile and all output will be published to a single partition at this level.

Request Stream

This is the input stream layer to which your requests will be written. The conversion pipeline reads from this layer and processes the requests. You will identify the stream layer to be used when you run the Wizard script.

Shapefile Input

  • Only shapefiles complying with the "ESRI shapefile Technical Description" will be accepted.
    • The shapefile must be in the form of a .zip archive (the layer where you store them should thus use Content type: application/zip).
    • All files contained in the .zip must reside at the root (not in subdirectories).
    • A main file (.shp), index file (.shx), and dBASE table (.dbf) must be present and all have the same basename. Other file types are optional (.prj for example) but again should use the same basename.
    • Only one layer per .zip is allowed (all component files must share the same basename).
  • There is a 2 GB size limit for any shapefile component file.
  • If using a Volatile layer for shapefile storage, the maximum size is just 2 MB (volatile storage limit).

Conversion Pipeline

Your pipeline will be deployed by the Wizard with configuration parameters determined by answers to the Wizard questions. Once deployed, it will monitor the request stream and perform conversions as new requests come in. See the Verification section of this document for information on checking pipeline status.

GeoJSON Output

For each successful conversion, a new Versioned layer will be added to your output catalog and will contain the resulting GeoJSON data. The layer name and layer id will be composed of the current timestamp and the basename of files contained in the shapefile .zip, formatted as YYYYMMDDHHMMSS-basename. HERETile partitioning will be applied to this new layer with tile level as specified in the Request (or automatically chosen based on input data if not specified).

Prerequisites

While running the Wizard, you will be asked to provide the following information. These values are needed in order to properly configure and deploy the pipeline. You should have this information available prior to executing the Wizard.

  • Group you would like the pipeline to be shared with. Make sure that all input and output catalogs are also shared with this Group.

  • Pipeline prefix - your deployed pipeline name will begin with whatever string you enter for prefix

  • Expected number of requests per hour

  • Average size of input shapefiles

  • Input stream catalog and layer

  • Handling of requests already queued on stream

Execution

In order to deploy and run this pipeline template, you will need the Wizard Deployer. The Wizard executes interactively, asking questions about the application, and expects the user to provide needed answers. Assuming you followed the Wizard's documentation instructions and set up the needed parameters beforehand, follow these steps:

  1. Execute the script as ./wizard.sh
  2. Follow the prompts and provide needed answers

NOTE:

  • One of the important things to consider before answering the Wizard questions is output Stream layer configuration. We recommend that you set the outbound throughput to be at least the expected number of consumers (users and pipelines) times the inbound throughput. The output rate can be higher if some consumers "replay" recent data. The inbound throughput must not be more than the outbound throughput. If it is, the consumer cannot read all the data that the producer provides.

You can use your existing output layer or let the Wizard create a new catalog/layer for you. If using an existing catalog, make sure it is shared with the GROUP_ID that will be used for the deployment of this pipeline template.

Set Up Input Stream Layer

A dedicated stream layer is needed before deploying your pipeline template. See the Data User Guide for details on creating and configuring catalogs and layers. Make sure to share your catalog with the same Group you plan to share the conversion pipeline with. As discussed earlier, conversion requests are in JSON format, so when creating the stream layer, be sure to set the CONTENT TYPE to application/json.

Shapefile Input Storage

In order to process any shapefile, it must first reside in either a Versioned or Volatile layer in the HERE platform, and be accessible by the conversion pipeline, that is, shared with the same Group. Since shapefiles are .zip archives, you must be sure that the layer you specify has been configured with CONTENT TYPE of application/zip. This applies whether you are specifying the catalog/layer directly in a request.json file or using the convenience script found in the utils folder of this distribution.

You are now ready to deploy your pipeline using the HERE platform Wizard Deployer. Make sure that you have followed the Wizard installation and configuration instructions.

GeoJSON Output Catalogs

This pipeline template does not require a dedicated output catalog, since each request specifies where the GeoJSON output for that request should be published. The point you need to remember is that all output catalogs specified in your requests must be shared with the same Group as the conversion pipeline. Each request will create a new Versioned layer in the named output catalog.

Verification

In Platform Portal select the Pipelines tab where you should be able to see your Pipeline deployed and running. Flink Dashboard provides important details about your running pipeline. You will be able to monitor such important metrics as number of messages or amount of data processed by the pipeline.

Once your request has successfully processed, you can go to your output catalog to visualize the resulting GeoJSON in the Portal.

The primary factor determining how long a request takes to process is the amount of time it takes to publish the output GeoJSON to new Versioned layer. It takes approximately 10 seconds to publish a single partition, so if you choose to specify an output tile level you should pick a size that is appropriate for the geographic extent of your shapefile. Choosing an appropriate tile level is also important for optimizing later access to (and visualization of) your GeoJSON data. Partitions containing a large number of features will take longer to access, and if the amount of data is extreme it might be beyond limits of the Platform Portal renderer. See here for an overview of the HERE Tile partition sizes.

Limitations and Workarounds

Below are the operational limitations that you should be aware of to ensure successful processing of your conversion requests.

Visualizing GeoJSON output

  • When a single partition contains more than 10 MB of data, it can be slow to load, and over 100 MB might not be feasible to load at all depending on your browser capabilities. If visualizing the GeoJSON output from within the Portal is something you want to do, it is recommended to specify a higher level (small partition size) in the tileLevel option of Request.

Cost Estimation

Executing this pipeline template will incur the following costs:

Storage-Blob

Cost will depend on the amount of test data being stored in a Versioned layer.

Metadata

Cost will depend on the amount and size of partitions (metadata) stored in the Versioned layer.

Storage-Stream and Stream TTL

Cost will depend on the configuration of the input Stream layer selected by the user, such as Throughput IN, Throughput OUT, and TTL.

Data Transfer IO

Cost will depend on amount of:

  • amount of data read from the input Stream layer
  • output data written to the Versioned layer.
Compute Core and Compute RAM

Cost will depend on the frequency configuration selected by the user. If high frequency is needed, more workers will be used for deployment.

Log Search IO

Cost will depend on the log level set for the execution of this pipeline template. To minimize this cost, the user can set the log level to WARN.

Support

If you need support with this pipeline template, please contact us.

results matching ""

    No results matching ""