Wednesday, 11 June 2014

Extract Coordinates and Other Data from KML in R

KML files are used to visualize geographical data in Google Earth. These files are written in XML and allow to visualize places and to attach additional data in HTML format.

In these days I am working with the MIDAS database of wind measuring stations across the world, which can be freely downloaded here:

First of all, the file is in KMZ format, which is a compressed KML. In order to use it you need to extract its contents. I used 7zip for this purpose.
The file has numerous entries, one for each point on the map. Each entry generally looks like the one below:

    <Snippet>ABERDEEN: GORDON BARRACKS</Snippet>  
       <tr><td><b>Name:</b><td>ABERDEEN: GORDON BARRACKS  
       <tr><td><b>Start date:</b><td>01-01-1956  
       <tr><td><b>End date:</b><td>31-12-1960  
       <tr><td><b>Postcode:</b><td>AB23 8  
       <center><a href="">Station details</a></center>  

This chunk of XML code is used to show one point on Google Earth. The coordinates and the elevation of the points are shown between the <coordinates> tag. The <styleUrl> tag tells Google Earth to visualize this points with the style declared earlier in the KML file, which in this case is a red circle because the station is no longer recording.
If someone clicks on this point the information in HTML tagged as CDATA will be shown. The user will then have access to the source ID of the station, the name, the location, the start date, end date, postcode and link from which to view more info about it.

In this work I am interested in extracting the coordinates of each point, plus its ID and the name of the station. I need to do this because then I have to correlate the ID of this file with the ID written in the txt with the wind measures, which has just the ID without coordinates.

In maptools there is a function to extract coordinates and elevation, called getKMLcoordinates.
My problem was that I also needed the other information I mentioned above, so I decided to teak the source code of this function a bit to solve my problem.

#Extracting Coordinates and ID from KML  
kml.text <- readLines("midas_stations.kml")  
re <- "<coordinates> *([^<]+?) *<\\/coordinates>"  
coords <- grep(re,kml.text)  
re2 <- "src_id:"  
SCR.ID <- grep(re2,kml.text)  
re3 <- "<tr><td><b>Name:</b><td>"  
Name <- grep(re3,kml.text)  
kml.coordinates <- matrix(0,length(coords),4,dimnames=list(c(),c("ID","LAT","LON","ELEV")))  
kml.names <- matrix(0,length(coords),1)  
for(i in 1:length(coords)){  
    sub.coords <- coords[i]  
    temp1 <- gsub("<coordinates>"," ",kml.text[sub.coords])  
    temp2 <- gsub("</coordinates>"," ",temp1)  
    coordinates <- as.numeric(unlist(strsplit(temp2,",")))  
    sub.ID <- SCR.ID[i]  
    ID <- as.numeric(gsub("<tr><td><b>src_id:</b><td>"," ",kml.text[sub.ID]))  
    sub.Name <- Name[i]  
    NAME <- gsub(paste("<tr><td><b>Name:</b><td>"),"",kml.text[sub.Name])  
    kml.coordinates[i,] <- matrix(c(ID,coordinates),ncol=4)  
    kml.names[i,] <- matrix(c(NAME),ncol=1)  

The first thing I had to do was import the KML in R. The function readLines imports the KML file and stores it as a large character vector, with one element for each line of text.
For example, if we look at the KML code shown above, the vector will look like this:

 kml.text <- c("<Placemark>", "<visibility>0</visibility>",   
 "<Snippet>ABERDEEN: GORDON BARRACKS</Snippet>", ...  

So if I want to access the tag <Placemark>, I need to subset the first element of the vector:
kml.text [1]

This allows to locate the elements of the vector (and therefore the rows of the KML) where a certain word is present.
I can create the object re and use the function grep to locate the line where the tag <coordinates> is written. This method was taken from the function getKMLcoordinates.

By using other key words I can locate the lines on the KML that contains the ID and the name of the station.

Then I can just run a loop for each element in the coords vector and collect the results into a matrix with ID and coordinates.

I am sure that this is a rudimentary effort and that there are other, more elegant ways of doing it, but this was quick and easy to implement and it does the job perfectly.

In this work I am interested only in stations that are still collecting data, so I had to manually filter the file by deleting all the <Placemark> for non-working stations (such as the one shown above).
It would be nice to find an easy way of filtering a file like this by ignoring the whole <Placemark> chunk if R finds this line: <styleUrl>#closed</styleUrl>

Any suggestions?


  1. You can work directly with the XML:

    doc <- xmlParse(kml.text)
    readHTMLTable(htmlParse(xpathSApply(doc, "//description", xmlValue)))[[1]]

    > readHTMLTable(htmlParse(xpathSApply(doc, "//description", xmlValue)))[[1]]
    src_id: 14929
    3 Start date: 01-01-1956
    4 End date: 31-12-1960
    5 Postcode: AB23 8

    coordinates <- xpathSApply(doc, "//coordinates",xmlValue)

    > coordinates
    [1] "-2.08602,57.1792,23"

    > xpathSApply(doc, "//styleUrl", xmlValue)
    [1] "#closed"

  2. This comment has been removed by the author.

  3. That works, but it's slow. Using readHTMLTable is also slow. Try using the rgdal package from CRAN:

    kmlfile <- "midas_stations_by_area.kml"

    #This gives you the layer names in the file. Your file has lots of layers, so I'll just pick one as an example.

    #Here we read in the CANADA layer from the KML file. This gives you a SpatialPointsDataFrame.
    midascanada <- readOGR(kmlfile, "CANADA")

    #The meta-data is still stuck in a blob of HTML/XML called Description. Here's how we pull that out with the XML CRAN package.
    #The xpath expression is important here. We're only grabbing the 2nd td field of each tr element because the first one is the data label. We're assuming you have the same number of fields for each point. Turns out to be true this time, but might not always be. You have been warned.

    midasmeta <- matrix(xpathSApply(htmlParse(as.character(midascanada@data$Description)), "//table//tr/td[2]", xmlValue), nrow = nrow(midascanada@data), byrow = TRUE)

    #Let's grab the field names. They're the same for each point, so let's save some time and only grab the first one.
    midasmetanames <- xpathSApply(htmlParse(as.character(midascanada@data$Description[1])), "//table//tr/td[1]", xmlValue)

    #This is just a bit of cleaning with the stringR package to get rid of the colon at the ends.
    midasmetanames <- str_sub(midasmetanames, end = -2L)

    #This puts the new columns onto the original dataframe
    midascanada@data <- cbind(midascanada@data, midasmeta)

    #This gives them names
    names(midascanada@data)[3:7] <- midasmetanames

    Voila. You can now coerce into anything you like. Feel free to drop the Description column. It's a huge space hog.

  4. Forgot to answer your actual question:

    From the midascanada SpatialPointsDataFrame above, you can get only the open stations with the following line (note the space after the word Current. This was in the data and could have been cleaned in the xpathSApply line above with the option trim = TRUE:

    midasopen <- subset(midascanada, midascanada$'End date' == "Current ")


Note: only a member of this blog may post a comment.