Hands on

Process CSV Address Information with Golang and the HERE Geocoder API

By Nic Raboy | 01 November 2018

Have you ever been given a comma separated value (CSV) file filled with people information where the data is only partially complete and you’re the one tasked with filling in the blanks? What if specifically you’re given data where the address information is incomplete? Do you, by hand, gather the information which could take X amount of time or do you try to automate the process?

Not too long ago, I wrote about a similar scenario I faced in a tutorial titled, Simple Data Processing with JavaScript and the HERE API. However, in that experience I demonstrated using Node.js.

In this tutorial we’re going to see how to parse a CSV file using the Go programming language (Golang) and clean up the address information using the HERE Geocoder API. To make things more interesting we’re going to follow best practices for concurrency to give the application the best possible performance.

The anticipated results should look something like the following image:

golang-geocode-csv

The above image doesn’t explain everything. For example, we have two lines in our CSV file and the location data for those lines is very vague. Think “Tracy, CA” and “Berlin”. When running our application, more sense is made of that location data.

Mapping REST Responses and the General Application Data Model

Before applying any application logic, we need to add our Golang boilerplate code and define the data models we wish to use. The most obvious of the data models is that of the actual HERE API response.

Create a main.go file within the $GOPATH and include the following:

package main

import (
    "bufio"
    "encoding/csv"
    "encoding/json"
    "flag"
    "fmt"
    "io"
    "io/ioutil"
    "net/http"
    "net/url"
    "os"
    "sync"
)

type Address struct {
    Label       string `json:"Label,omitempty"`
    Country     string `json:"Country,omitempty"`
    State       string `json:"State,omitempty"`
    County      string `json:"County,omitempty"`
    City        string `json:"City,omitempty"`
    District    string `json:"District,omitempty"`
    Street      string `json:"Street,omitempty"`
    HouseNumber string `json:"HouseNumber,omitempty"`
    PostalCode  string `json:"PostalCode,omitempty"`
}

type GeoData struct {
    Response struct {
        View []struct {
            Type   string `json:"_type"`
            Result []struct {
                Location struct {
                    Address         Address `json:"Address"`
                    DisplayPosition struct {
                        Latitude  float64 `json:"Latitude"`
                        Longitude float64 `json:"Longitude"`
                    } `json:"DisplayPosition"`
                } `json:"Location"`
            } `json:"Result"`
        } `json:"View"`
    } `json:"Response"`
}

type LeadData struct {
    Firstname string  `json:"first_name"`
    Lastname  string  `json:"last_name"`
    Company   string  `json:"company"`
    Email     string  `json:"email"`
    Location  string  `json:"location,omitempty"`
    Address   Address `json:"address,omitempty"`
}

type Geocoder struct {
    AppId   string `json:"app_id"`
    AppCode string `json:"app_code"`
}

type ParsedLeadData struct {
    mux  sync.Mutex
    data []LeadData
}

var waitGroup sync.WaitGroup
var geocoder Geocoder
var leadData chan LeadData
var parsedLeadData ParsedLeadData

func main() {}

The code above looks worse than it actually is. The Address data structure will hold all our information about an address. There are JSON annotations because of how it will be used with an actual response. The omitempty reference is there to avoid displaying null or empty values when we print out our data.

The GeoData data structure is the actual response from the HERE API. The GeoData data structure uses the previously defined Address data structure and is actually far from complete. Looking at the documentation, there are other properties returned, but the properties we modeled are all that we really care about.

The LeadData data structure represents the model of our CSV data. It is simple, but specific. The CSV data will have information about a person as well as a generic location that could be as complete or as incomplete as the user wants to make of it. The Address data structure is present because that is where we’ll store the geocoded data.

The Geocoder is where we’ll store the app id and app code values obtained from a free HERE developer account. This information will be critical towards using the API.

The ParsedLeadData data structure may be the most confusing if you’re a novice user of Golang. Because we plan to use concurrent goroutines in our application, things will happen in parallel. We cannot add our geocoded data asynchronously to a slice in Go, so we need to use a Mutex so that we can lock the slice from other goroutines trying to write to it at the same time.

var waitGroup sync.WaitGroup
var geocoder Geocoder
var leadData chan LeadData
var parsedLeadData ParsedLeadData

The above variables will be used throughout the application. The waitGroup variable will keep track of our worker goroutines, the geocoder will store our API credentials, the leadData will contain each line of our CSV file, and the parsedLeadData will hold our finalized results.

Using the HERE Geocoder API via HTTP Requests

It may be difficult to believe, but the easiest part of this application is the geocoder functionality. The reason for this is because we are only making an HTTP request to the HERE API.

Within your project’s main.go file, include the following function:

func (geocoder *Geocoder) geocode(query string) (GeoData, error) {
    endpoint, _ := url.Parse("https://geocoder.api.here.com/6.2/geocode.json")
    queryParams := endpoint.Query()
    queryParams.Set("app_id", geocoder.AppId)
    queryParams.Set("app_code", geocoder.AppCode)
    queryParams.Set("searchtext", query)
    endpoint.RawQuery = queryParams.Encode()
    response, err := http.Get(endpoint.String())
    if err != nil {
        return GeoData{}, err
    } else {
        data, _ := ioutil.ReadAll(response.Body)
        var geoData GeoData
        json.Unmarshal(data, &geoData)
        return geoData, nil
    }
}

In the above function, we are taking the API credentials as well as a query string with a potentially complex address. Using that information we can construct an HTTP request. After executing the HTTP request, we store the response in a GeoData variable and return it to where this function was called. As long as our GeoData is modeled how we want it, gaining access to this information is easy because the API does all the heavy lifting.

Loading Data from a CSV File

With the model and API functionality in place, we need to be able to collect data from a CSV file. Golang does have some nice CSV functionality baked in to make our lives easier.

Open the project’s main.go file and include the following:

func LoadCSV(filepath string) error {
    csvFile, _ := os.Open(filepath)
    reader := csv.NewReader(bufio.NewReader(csvFile))
    for {
        line, err := reader.Read()
        if err == io.EOF {
            break
        } else if err != nil {
            return err
        }
        leadData <- LeadData{
            Firstname: line[0],
            Lastname:  line[1],
            Company:   line[2],
            Email:     line[3],
            Location:  line[4],
        }
    }
    return nil
}

In the above function, we are expecting a file path. With the file path, we open the file and start reading it as a CSV file. For each line of the CSV file, until we hit the end of the file, we convert the columns to a LeadData object and add it to our channel to be used concurrently by our worker threads. The workers will read from the channel in the same order that it was added.

You can learn more about reading CSV data with Golang in a previous tutorial I wrote titled, Parse CSV Data using the Go Programming Language.

Designing Workers for Concurrent Go Development

With the CSV function reading lines of the CSV file into our channel, we need to determine what each of our workers will do with that data. Take the following worker function for example:

func worker() {
    defer waitGroup.Done()
    for {
        lead, ok := <-leadData
        if !ok {
            break
        }
        address, _ := geocoder.geocode(lead.Location)
        if len(address.Response.View) > 0 && len(address.Response.View[0].Result) > 0 {
            lead.Address = address.Response.View[0].Result[0].Location.Address
            lead.Location = ""
        }
        parsedLeadData.mux.Lock()
        parsedLeadData.data = append(parsedLeadData.data, lead)
        parsedLeadData.mux.Unlock()
    }
}

When the worker is started, it will remain running until the channel is empty. When the channel is empty, the loop is broken and the waitGroup variable is alerted. You’ll see the importance of the waitGroup soon.

Assuming the channel is not empty, the generic location string from the CSV file is passed to our geocode function and the API response is obtained. If the API response has address data, it is added to our lead and the location is cleared out. To write our parsed data, we first lock our slice, append the parsed data, then unlock the slice so the next thread can access it.

Parsing Address Information in Parallel with Goroutines

With all our functions in place, we really just need to bring it all together. When the application is launched, we need to collect user input and determine how to set up our workers. Take the following main function:

func main() {
    filepath := flag.String("p", "data.csv", "Path to CSV Data")
    workers := flag.Int("w", 5, "Worker Thread Count")
    flag.Parse()
    leadData = make(chan LeadData)
    geocoder = Geocoder{AppId: "APP-ID-HERE", AppCode: "APP-CODE-HERE"}
    for i := 0; i < *workers; i++ {
        waitGroup.Add(1)
        go worker()
    }
    LoadCSV(*filepath)
    close(leadData)
    waitGroup.Wait()
    output, _ := json.Marshal(parsedLeadData.data)
    fmt.Println(string(output))
}

We are accepting two user input flags and defaulting them if they are not present. These flags determine the file path and how many workers to use. Depending on the size of your CSV file and the power of your computer should determine how many workers you use. I probably wouldn’t recommend more than 20.

After initializing our variables, we start our goroutines. Notice how we are adding one to the waitGroup for each worker. This tells the application that we need the waitGroup to be zero for the application to end. We’re doing this so our application doesn’t exit before our asynchronous processes complete. When every worker ends, the waitGroup subtracts.

With the workers waiting for data, we call the LoadCSV function and populate the channel that the workers use. Now we actually wait until the workers are done. When the workers are done, we convert our parsed data into bytes and print it on the screen.

Conclusion

You just saw how to parse potentially massive amounts of CSV location data using Golang and the HERE Geocoder API. In this example, we read from a CSV and made requests for location data in parallel, depending on how many worker threads were available. This tutorial was an alternative to a previous tutorial that I had written in Node.js.

There are a few things to note that I hadn’t covered in this tutorial. The HERE Geocoder API could return more than one result for a given query string. The more vague the query, the more likely for more results. I also ignored many of the errors that could have been produced throughout the tutorial. In a production application, don’t ignore them with an underscore.

If you’d like to see how to reverse geocode data, as in take latitude and longitude coordinates and convert them into addresses using Golang, check out a tutorial I wrote titled, Reverse Geocoding Coordinates to Addresses with the Go Programming Language.