sns.boxplot()

– generic boxplotsns.distplot()

– histogram and kernel density estimate (KDE) plotted togethersns.distplot(rug=True)

– rugplot

sns.kdeplot()

– kernel density estimate plotsns.kdeplot(n_levels)

– set the n_levels parameter high to make the KDE finer

sns.rugplot()

– rugplotsns.jointplot()

– show a scatterplot and marginal histogram for two-dimensional data.sns.jointplot(kind='hexbin')

– hexbin plot, like a two-dimensional histogramsns.jointplot(kind='kde')

– two-dimensional KDE (might take a while to plot for large datasets)sns.jointplot(kind='reg')

– scatterplot, regression line and confidence interval- The
sns.jointplot()

function returns a JointPlot object, which you can exploit by saving the result and then adding to it whatever you feel like. Some examples:

# Save the JointPlot g = sns.jointplot(x="x", y="y", data=df, kind="kde", color="m") # Use plot_joint to add a scatter plot overlay g.plot_joint(plt.scatter, c='w', s=1) # Or a regression line: g.plot_joint(sns.regplot)

sns.pairplot()

– used for exploring the relationships between variables in a data frame. By default, plots a scatterplot matrix on off-diagonals and histograms on diagonals. Similar to the R functionggpairs()

in the GGally package.- Similar to how
jointplot()

returns a JointGrid,pairplot()

returns a PairGrid with its own set of methods available to it. You can use this to change what graphs are plotted:

- Similar to how

# Store the PairGrid object g = sns.PairGrid(iris) # Change the plots down the diagonal g.map_diag(sns.kdeplot) # Change the plots down the offdiagonals g.map_offdiag(sns.kdeplot, cmap="Blues_d", n_levels=6);

sns.stripplot()

– Like a scatterplot, but one of the variables is categoricalsns.stripplot(jitter=True)

– stops the points from overlapping as much

sns.swarmplot()

– beeswarm plot that works likestripplot()

above, but avoids overlap entirely.sns.swarmplot(hue)

– set the ‘hue’ parameter to use colour to distinguish levels of a variable – e.g. blue for male, red for female

sns.violinplot()

– draw a violinplot with a boxplot inside it.sns.violinplot(hue, split=True)

– if the ‘hue’ variable has two levels, then you can spit it so the violin plots won’t be symmetricalsns.violinplot(inner='stick')

– show the individual observations inside the violin plot, rather than a boxplot

sns.barplot()

– standard barplot, complete with bootstrapped confidence intervalssns.countplot()

– histogram over a categorical variable, as opposed to the regular histogram which is over a continuous variablesns.pointplot()

– plot the interaction between variables using scatter plot glyphs:

sns.factorplot()

– draw multiple plots on different facets of your data. Combines plots (like the ones above) with a FacetGrid, which is a subplot grid that comes with a range of methods.sns.factorplot(kind)

– specify the type of your plot. Choose between ‘point’, ‘bar’, ‘count’, ‘box’, ‘violin’ and ‘strip’. ‘Swarm’ seems to work too, at least according to the official tutorial (use a Find search to find the example)

sns.regplot()

– plot a scatterplot, simple linear regression line and 95% confidence intervals around the regression line. Accepts*x*and*y*variables in a variety of formats. Subset ofsns.lmplot()

sns.lmplot()

– likesns.regplot()

, but requires a*data*parameter and the column names to plot specified as strings.sns.lmplot(x_jitter)

– add jitter in the x-direction. Useful when making plots where one of the variables takes discrete values.sns.lmplot(x_estimator)

– instead of points, plot an estimate of central tendency (like a mean) and a rangesns.lmplot(order)

– fit non-linear trends with a polynomial (applies to regplot too)sns.lmplot(robust=True)

– fit robust regression, down-weighing the impact of outlierssns.lmplot(logisitic=True)

– logistic regressionsns.lmplot(lowess=True)

– fit a scatterplot smoothersns.lmplot(hue)

– fit separate regression lines to levels of a categorical variablesns.lmplot(col)

– create facets along levels of a categorical variable

sns.residplot()

– fits a simple linear regression, calculates residuals and then plots themsns.heatmap()

– takes rectangular data and plots a heatmapsns.clustermap()

– hierarchically clustered heatmapsns.tsplot()

– time series plotting function. Has the option to include uncertainty, bootstrap resamples, a range of estimators and error bars.sns.lvplot()

– letter value plot, which is like a better boxplot for when you have a high number of data points

sns.get_dataset_names()

– list all the toy datasets available on the Seaborn online repositorysns.load_dataset()

– load a dataset from the Seaborn online repositorysns.FacetGrid

,sns.PairGrid

,sns.JointGrid

– grids of subplots used for plotting, each somewhat different and each with their own set of methodssns.despine()

– remove top and right axes, making the plot look better

sns.set()

– set plotting options to seaborn defaults. Can use to reset plot parameters to the default values.sns.set_style()

– change the default plot themesns.set_context()

– change the default plot context. Used to scale the plots up and down. Options arepaper

,notebook

,talk

andposter

, in order from smallest to largest scale.sns.axes_style()

– temporarily set plot parameters, often used for a single plot. For example:

with sns.axes_style("white"): sns.jointplot(x=x, y=y, kind="hex", color="k");

sns.color_palette()

– return the list of colours in the current palette- The ‘hls’ colour palette is one option; see the list of colours with
sns.palplot(sns.color_palette("hls", 8))

. - Another (better) option is the husl system; see the list of colours with
sns.palplot(sns.color_palette("husl", 8))

- Use ‘Paired’ to access ColorBrewer colours:
sns.palplot(sns.color_palette("Paired"))

. Likewise you can put in other parameters; for example,sns.palplot(sns.color_palette("Set2", 10))

for the “Set2” palette. - Tack on ‘_r’ with ColorBrewer palettes to reverse the colour order. Compare the difference between
sns.palplot(sns.color_palette("BuGn_r"))

andsns.palplot(sns.color_palette("BuGn"))

. - Tack on ‘_d’ with ColorBrewer palettes to create darker palettes than usual. See
sns.palplot(sns.color_palette("GnBu_d"))

compared tosns.palplot(sns.color_palette("GnBu"))

- The ‘hls’ colour palette is one option; see the list of colours with
sns.palplot()

– plot colours in a palette in a horizontal arraysns.hls_palette()

– more customisation of the ‘hls’ palettesns.husl_palette()

– more customisation of the ‘husl’ palettesns.cubehelix_palette()

– more customisation of the ‘cubehelix’ palettesns.light_palette()

andsns.dark_palette()

– sequential palettes for sequential data.sns.diverging_palette()

– pretty self explanatorysns.choose_colorbrewer_palette()

– launch an interactive widget to help you choose ColorBrewer palettes. Must be used in a Jupyter notebook.sns.choose_cubehelix_palette()

– similar tosns.choose_colorbrewer_palette()

, but for the cubehelix colour palette.sns.choose_light_palette()

andsns.choose_dark_palette()

– launch interactive widget to aid the choice of palette.sns.choose_diverging_palette()

– guess what this does

**Using colour palettes**

Use the

cmapargument to pass across colour palettes to a Seaborn plotting function:

x, y = np.random.multivariate_normal([0, 0], [[1, -.5], [-.5, 1]], size=300).T cmap = sns.cubehelix_palette(light=1, as_cmap=True) sns.kdeplot(x, y, cmap=cmap, shade=True);

You can also use the

set_palette()function that changes the default matplotlib parameters so the palette is applied to all plots:

sns.set_palette("husl")]]>

Besides its utility for installing and managing packages, conda also possesses the ability to create virtual environments which make sharing and reproducing analyses much easier. These virtual environments are created without any Python packages preloaded into them.

Installing Python packages into the virtual environment is often straightforward. You can use the

conda installcommand to install many packages quickly and easily.

Not all packages are available with

conda install, through, and if you want one that isn’t available then you’ll have to use the alternate package manager pip. It is not at all obvious how Anaconda’s package manager conda and pip interact with each other, particularly in the context of virtual environments.

Here is how to install packages using pip inside a conda virtual environment. First thing is to get set up:

- Create your virtual environment with
conda create --name virtual_env_name

, replacing ‘virtual_env_name’ with the name of your virtual environment - Switch to your virtual environment with
source activate virtual_env_name

, again replacing ‘virtual_env_name’ with the name of your virtual environment - Run
conda install pip

, which will install pip to your virtual environment directory

At this point you have two versions of

pipinstalled; a global version and a version specific to your virtual environment. If you try to run

pip install 'package_name'you’ll use the global version, which will install the package outside your virtual environment.

You need to use the version of

pipinside your virtual environment. To do this you need to find the directory of your virtual environment, which will be somewhere like “/anaconda/envs/virtual_env_name/”. You can install packages into your virtual environment with

/anaconda/envs/venv_name/bin/pip install package_name.

That’s all there is to it.

]]>Here is my quick reference list of functions. Note that since reading written material is no substitute for repeated practice, you should not expect to remember the functions below. Better to treat this list as a cheatsheet to refer to when working through practice problems, such as the ones here.

pd.DataFrame(x, index, colnames)creates a pandas dataframe from some data

For example:

dates = pd.date_range('2017-06-21', '2017-06-27') pd.DataFrame(np.random.randint(0,10,7), index=dates, columns=['freq'])

You can also create a dataframe without following this syntax. Here’s a multi-column version from a dictionary:

x = {'a' : np.random.randint(0,10,7), 'b' : np.random.randint(0,10,7)} pd.DataFrame(x)

To create a Series use

pd.Series(x, index)– it’ll let you create a series from an array/dict/scalar

The following are some useful dataframe functions:

pd.DataFrame.head()

– returns the first five rows of a dataframe.pd.DataFrame.tail()

– returns the last five rows of a dataframe.pd.DataFrame.index

– display the index of a dataframe.pd.DataFrame.columns

– list the columns of a dataframe.pd.DataFrame.dtypes

– print the data types of each column of a dataframe.pd.DataFrame.values

– print the values of a dataframe.pd.DataFrame.describe()

– summarise a dataframe: return summary statistics including the number of observations per column, the mean of each column and the standard deviation of each column.pd.DataFrame.info()

– brief summary of a dataframe.pd.DataFrame.T

– transpose a dataframe.pd.DataFrame.sort_index()

– sort a dataframe by its index values. Can specify the axis (colnames, rownames) and the order of sorting.pd.DataFrame.sort_values('col')

– sort a dataframe by the column name*col*.pd.DataFrame.iloc[i]

– slice and subset your data by a numerical index.pd.DataFrame.loc[]

– slice and subset your data by using string(s).pd.DataFrame.isin(l)

– return True or False depending if the item value is in the list*l.*pd.DataFrame.set_index(s)

– set the index of a data frame to column name(s)*s*, where*s*can be an array of columnnames to create a MultiIndex.pd.DataFrame.swaplevel(i,j)

– swap the levels*i*and*j*in a MultiIndex.pd.DataFrame.drop('c1', axis=1, inplace=True)

– drop a column*c1*from a dataframe.pd.DataFrame.iterrows()

– a generator for iterating over the rows of a dataframe.pd.DataFrame.apply(f, axis)

– apply a function*f*vectorwise to a dataframe over a given axis.pd.DataFrame.applymap(f)

– apply a function*f*elementwise to a dataframe.pd.DataFrame.drop(s, axis=1)

– delete column*s*from a dataframe.pd.DataFrame.resample('offsetString')

– convenient way to group timeseries into bins. See here for details on the offset string and here for some examples.pd.DataFrame.merge(df2)

– join a dataframe*df2*to another dataframe. Can specify the type of join.pd.DataFrame.append(df2)

– append the dataframe*df2*to a dataframe (similar torbind()

in R).pd.DataFrame.reset_index()

– reset the index back to the default numeric row counter.pd.DataFrame.idxmax()

– dataframe equivalent of the numpyargmax

method.pd.DataFrame.isnull()

– indicates if values are null or not.pd.DataFrame.from_dict(d)

– create a dataframe from a dictionary*d.*pd.DataFrame.stack()

– turn column names into index labels.pd.DataFrame.unstack()

– turn index values into column names.

To group a dataframe by a column (or columns), use

pd.DataFrame.groupby('colname'). This returns a DataFrameGroupBy object, on which you can call a certain set of methods.

Assume *gb* is a DataFrameGroupBy object returned from calling

pd.DataFrame.groupby(). There are a basic family of functions that you can commonly call on these objects; sum, min, max, mean, median and std will all be very useful to you. Some other useful methods are:

gb.agg(arr)

– returns whatever functions you specify in array*arr.*gb.size()

– return the number of elements in each group.gb.describe()

– returns summary statistics.

The pandas library contains a module dedicated to string manipulation and string handling. This module, called *str*, operates on Series objects and is located at pd.Series.str in the pandas hierarchy of functions.

Let *s* be a Series made up of strings. Then the following are some useful methods:

s.str[0]

– return the first letter of each element in*s.*s.str.lower()

– change each element of*s*to lowercase.s.str.upper()

– change each element of*s*to uppercase.s.str.len()

– return the number of letters of each element of*s.*s.str.strip()

– remove whitespace around the elements of*s.*s.str.replace('s1', 's2')

– replace a substring*s1*with a substring*s2*for each element of*s.*s.str.split('s1')

– split up the elements of*s*using*s1*as a separator.s.str.get(i)

– extract the*ith*element of each array of*s.*

pd.__version__

– return the version of pandas.pd.date_range()

– create a series of dates in a DateTimeIndex. Some options include:- a start date and an end date (e.g.
pd.date_range('2015-01-05', '2015-01-10')

) - a start date, end date and a frequency (e.g.
pd.date_range('2016-01', '2016-10',freq='M')

) - a start date and the number of periods (e.g.
pd.date_range('2016-01', periods=10)

)

- a start date and an end date (e.g.
pd.read_csv(filepath, sep, index_col)

– read in a csv file, often from a web address or file. Specify the separator with the*sep*parameter, and the column to use as the rownames of the table with the*index_col*parameter.pd.value_counts()

– count how many times a value appears in a column.pd.crosstab()

– create frequency table of two or more factors.pd.Series.map(f)

– the Series version of applymap.pd.to_datetime()

– convert something to a numpy datetime64 format.pd.to_numeric()

– convert something to a float format.pd.concat(objs)

– put together data frames in the array*objs*along a given axis, similar torbind()

orcbind()

in R.

This list is by no means complete; nor does it pretend to be complete. This list is simply of functions I have encountered in my journey learning pandas.

If you create your own list, and post your list on your blog, and send me a link to your list, then we both may learn something new today.

]]>Looking for an easy reference of useful numpy functions? Check this list out.

Open up a new script and import the numpy package:

import numpy as np

Now cast your eye over these functions.

In no particular order:

**np.__version__**– return the version of numpy you have loaded.**np.shape(x)**– return the shape of an array*x ;*essentially the number of rows and columns in*x***.****np.ndim(x) –**return the number of dimensions of an array.**np.zeros(shape)**– create an array of zeros in the shape you specify.**np.ones(shape)**– create an array of ones in the shape you specify.**np.eye(n)**– create a*n*x*n*identity matrix.**np.arange(start, stop, step)**– create evenly spaced values that are*step*apart between a start and end value.**np.linspace(start, stop, num)**– create*num*evenly spaced values between a start and end value.**np.reshape(x, newshape)**– change the shape of*x*to*newshape.***np.random.random(size)**– return*size*random numbers between [0,1).**np.random.rand(d0, d1, …, dn)**– random uniformly distributed values in [0,1), in shape (*d0*,*d1*, …,*dn*).**np.random.randn(d0, d1, …, dn)**– random normally distributed values from the standard normal distribution, in a shape (*d0*,*d1*, …,*dn*).**np.random.normal(loc, scale, size)**– draw*size*random samples from a*N*(*loc, scale^2*)**np.random.randint(low, high, size)**– draw*size*random numbers from a U(*low, high*) distribution.**np.pad(x)**– pads an array. Parameters determine what you pad the array with, how large the pad is and the mode of padding (there are lots!).**np.diag(x, k)**– construct a diagonal array, with values*x*down the diagonal*k.***np.tile(x, reps)**– repeat*x*a total of*reps*times, where*reps*can be of multiple dimensions.**np.unravel_index(indices, dims)**– in an array of shape*dims*, what is the index of the “*indices”th*element? For example, np.unravel_index( 32, (3,3,5) ) = (2, 0, 2).**np.dtype()**– create your own custom data types.**np.dot(A, B)**– find the dot product of two matrices*A*and*B.***np.ndarray.astype(dtype)**– change the data type of an array while making a copy of it.**np.ceil(x)**– rounds decimal numbers up to the nearest integer.**np.floor(x)**– rounds decimal numbers down to the nearest integer.**np.copysign(x1, x2)**– changes the sign of elements in array*x1*to that of elements in array x2, comparing element-wise.**np.intersect1d(x1, x2)**– find the intersection of array*x1*and array*x2*, returning an ordered set.**np.union1d(x1, x2)**– find the union of array*x1*and array*x2*, returning an ordered set.**np.datetime64(‘s1’)**– convert a string*s1*to a numpy datetime.**np.timedelta64(‘s1’)**– convert a string*s1*to a numpy timedelta, with which you can perform date arithmetic.**np.arange(‘s1’, ‘s2′, dtype=’datetime64[D]’)**– get a list of days between two dates*s1*and*s2.***np.add(x1, x2, out)**– add two arrays*x1*and*x2*. If*out*equals*x1*, then*x1*will be overwritten with the result of the addition. Same thing for np.multiply, np.divide, np.negative.**np.trunc(x)**– get rid of decimal points in an floating point array*x*, leaving just the integer components.**np.sort(x)**– sort an array*x*in ascending order.**np.sum(x, axis)**– return the sum of an array*x*over a particular axis.**np.add.reduce(x, axis)**– a quicker way of finding sum of an array*x*over a particular axis, for small*x*. This is an example of a ufunc.**np.array_equal(x1, x2)**– check to see if two arrays*x1*and*x2*are equal.**np.meshgrid(x1, x2)**– create a 2d rectangular grid of values from array*x1*and array*x2*. See here for further explanation.**np.outer(x, y)**– calculate the outer product of two vectors*x*and*y*.**np.setprintoptions(threshold)**– change the number of elements displayed when printing an array to the console.**np.argmax(x)**– return the indices of the maximum values along an axis for an array*x*.**np.argmin(x)**– return the indices of the minimum values along an axis for an array*x*.**np.put(x, ind, v)**– put values*v*into an array*x*at indices*ind,*replacing what was there before.**np.argsort(x)**– return indices that would sort an array*x*. See here for further explanation.**np.any(x, axis)**– test if any array element of*x*along a given axis evaluates to True.**np.ndarray.flat()**– a flat iterator object to iterate over arrays. Can be indexed with square brackets.

Knowing how to use these functions will give you a good starting base for your numpy adventures. Good luck!

]]>You own a van and you drive around places. This fine day you’re out looking for broken telephone poles in a suburb.

You drive around and find twenty telephone poles. Out of these twenty, you reckon three of them are dodgy, but it’s 3pm and you don’t have time to fix them today.

It’s important for you to be able to find the dodgy poles later, so while you’re at each one you take out your GPS unit and take a reading. Before you clock off for today, you check the history of the GPS unit and write down the three pairs of numbers there:

(151.2092, -33.8684) (151.2010, -33.8700) (151.2103, -33.8649)

You sleep easy because you know that with these coordinates written down, you’ll be able to find the poles whenever you like. You know that the coordinates represent longitude and latitude, and that if you plot them on a map you’ll be able to see where the poles are with no dramas.

What you might not realise is that you used a coordinate reference system to find the location of those poles. Coordinate reference systems are used to represent the locations of things on the Earth, and your GPS reciever works by utilising one.

By unwittingingly utilising this coordinate reference system underpinning your GPS receiver, you can easily locate which telegraph poles need fixing.

Nifty.

A Coordinate Reference System (CRS) is used to uniquely identify the location of things relative to the Earth. It also goes by the name Spatial Reference System (SRS).

Think of the Cartesian plane you studied at school: when you saw a set of coordinates of a point, you instantly knew the exact location of that point. It’s the same thing in the geographical world, just more complicated.

The spatial data you find will be created by using some CRS. Points, lines, polygons, raster sheets – all of this data has to refer to a CRS. It would be meaningless to call it spatial data otherwise.

CRS’s usually come in two categories. These are geographic and projected, and they are heavily related to one another.

In a nutshell: geographic Coordinate Reference Systems define locations using a 3D surface, and measure location with latitude and longitude.

There seems to be a lot of ambiguity surrounding what a geographic CRS is and what it contains. Many sources disagree on the topic and terminology used is often unclear. To keep things consistent I’ll be going by the definitions found on the EPSG website.

With that being said, a geographic CRS is made up of two things:

- A coordinate system
- A datum

The terms “coordinate system” and “coordinate reference system” are often considered as the same thing, but in this context they are different. Here, a coordinate system is a set of axes with their properties (axes, axis names, order, abbreviations etc) and contained within a CRS.

Coordinate systems are classified according to the geometric properties of their coordinate space and of the shape of their axes. They come in a few different forms. Here are a few of them:

**Cartesian coordinate system:**position is given assuming the axes are orthogonal and straight. Each axis is measured in the same units.**Ellipsoidal coordinate system:**position is given by latitude, longitude, and optionally height. This coordinate system is the one used in geographic CRS’s.**Vertical coordinate system:**a one dimensional coordinate system that records heights of points above the Earth’s surface.

A datum is used to define a few things: the position of the origin, the scale and the axis orientation of a coordinate system. All these things are defined with respect to an object, which is typically the Earth.

Datums come in different forms. Here are two of them:

**Geodetic datum:**defines the model of the Earth to use when calculating coordinates. The model of the Earth is usually an ellipsoid or a sphere. Also contained in the datum is the location and the orientation of the model.**Vertical datum:**describes a reference level surface which is also known as the “zero-height” surface. The position of the zero-height surface with respect to the Earth is also described.

Coordinates only make sense with considered in conjuction with a model of the Earth. The same location on Earth under two different Earth models will be represented by two different sets of coordinates.

One example of a geographic CRS is WGS 84. WGS 84 goes by the EPSG code EPSG 4326 and is one of the most important geographic CRS’s. It is the geographic CRS of choice for the omnipresent GPS system.

(Don’t know how EPSG codes work? Click here.)

Like other geographic CRS’s, WGS 84 is made up of a coordinate system and a datum.

Here is a diagram of the EPSG codes for WGS 84:

The coordinate system of WGS 84 is an ellipsoidal coordinate system. In this coordinate system position is given by latitude and longitude. The north direction on the latitude axis is taken as positive, and likewise the east direction on the longitude axis is taken as positive.

The datum of WGS 84 is a geodetic datum, going under the EPSG code 6326. This datum models the Earth as an ellipsoid with a semi-major axis radius of 6,378,137m, and an inverse flattening of 298.257223563 (EPSG 7030). The Prime Meridian defines the location and origin of the model. The Prime Meridian is defined as running through Greenwich (EPSG 8901).

This CRS is defined as a 2D CRS, since the ellipsoidal height is not provided in the datum. If it were given, WGS 84 would be a 3D CRS instead.

The EPSG database also defines the area of use for a geographic CRS. While WGS 84 is suitable to be used across the world (EPSG 1262), other geographic CRS’s are only suitable in certain areas.

Projected CRS’s can be thought of as the two-dimensional cousin of the three-dimensional geographic CRS. The geographic CRS represents data using a three-dimensional construct; the projected CRS uses projections to transform points from the three-dimensional construct to a two-dimensional map.

Simply put, a projection is a series of transformations that convert the locations of points on a three dimensional surface (defined in the geographic CRS) to locations on a flat surface (defined in the projected CRS).

A projected CRS is made up of three things:

- A geogaphic CRS
- A coordinate system
- A map projection

A geographic CRS uses an ellipsoidal coordinate system; a projected CRS uses a Cartesian coordinate system. A geographic CRS uses latitude, longitude and degrees; a projected CRS uses northings, eastings and metres (or feet, kilometres etc).

It is difficult to represent a three-dimensional Earth as a two-dimensional map. To mitigate these difficulties a vast number of map projections exist, each with their own strengths and weaknesses.

One way of thinking about a map projection is as a way of converting geographic coordinates (latitude and longitude) into Cartesian coordinates (and vice versa). You could also say that a map projection converts coordinates referenced in a geographic CRS to coordinates referenced in a projected CRS.

For a successful projection you need more than just the name of the map projection. You need projection parameters. These parameters answer questions like:

- Where is the centre of the projection?
- What is the scale factor at each point?
- Where are the standard parallels?
- What are the values of the false easting and false northing?

Let’s look at an example.

The NAD83 / Canada Atlas Lambert (EPSG 3978) projection is a projected CRS. This projected CRS uses the geographic CRS known as NAD83 and the map projection Lambert Conic Conformal. The parameters are also provided, including specifiying that this conical projection uses two Standard Parallels.

Here’s the diagram of the EPSG codes for this projected CRS. It’s quite complicated.

Under the gold-standard EPSG system, a map projection is a subcategory of a “Coordinate Conversion”, or a Conversion for short. Included within the Conversion is a category that details the area of the projection, the conversion method of the projection and the projection parameters.

It’s evident that the projected CRS uses a Cartesian Coordinate system, a geographic CRS and a map projection. You can see that there are two coordinate systems referenced here: the ellipsoidal coordinate system used in the geographic CRS, and the cartesian coordinate system used in the projected CRS.

For many applications of spatial data you won’t need to think about what coordinate system it’s referenced as. You’ll put together the layers of data and things will just work.

It’s when things go wrong where you’ll need to dig deeper. Perhaps layers aren’t aligning properly, lakes don’t look right or the position of your data points suspicious – maybe the error can be traced back to a CRS mismatch somewhere.

Or maybe you can’t find those broken telegraph poles again. Probably should have taken a photo.

]]>Or at least, it used to be. It was absorbed by IOGP (International Association of Oil & Gas Producers) in 2005 and ceased operation as an independent body.

We don’t remember EPSG for its history as an entity, but rather for the database that it compiled. EPSG created the EPSG Geodetic Parameter Set – a comprehensive database of coordinate reference systems, datums, ellipsoids, and other such geodetic parameters. Although EPSG as an organisation is no more, its database lives on and is still updated and maintained.

Each entry in the EPSG database has a unique code associated with it. These codes are known as EPSG codes and they seem to be found everywhere that spatial data lurks.

An EPSG code might refer to a:

- Coordinate Reference System (CRS) – like EPSG4326, which refers to the coordinate reference system WGS 84.
- Datum – like EPSG 6326, which refers to the datum used in the coordinate reference system WGS 84.
- Area of use – like EPSG 1262, which refers to the entire world.
- Prime Meridian – like EPSG 8901, which refers to the meridian passing through Greenwich.

This is not a comprehensive list. EPSG codes also refer to ellipsoids, spheroids and miscellaneous other things.

For further illustration here is the EPSG structure for WGS 84, a commonly used geographic Coordinate Reference System:

Evidently, many EPSG codes together make up the WGS84 coordinate system. The name WGS84 appears three times in this hierarchy: under the entries for Geodetic CRS, Geodetic Datum and Ellipsoid, and as such the use of this code is ambiguous. The code EPSG 4326 by contrast is unambiguous: it refers only to the coordinate reference system, not the ellipsoid or the geodetic datum.

Next time you meet an EPSG code in the wild, you’ll be prepared.

]]>Ideally we wouldn’t have to use maps because we’d use globes for everything. But globes aren’t convenient. Flat maps can be viewed on computer screens, printed out, displayed on a wall, rolled up into a scroll, and are simply much more useful and convenient than globes are.

There’s another problem with globes in that they’re also only convenient to use at small scales. If you wanted to illustrate spatial data where you need suburb-level detail, you’d need a massive globe!

Two dimensional maps it is then. How do we make one? Through something called a *projection*.

Suppose you had a transparent globe with a light bulb in its centre, and you also had a big sheet of paper. What you could do is wrap the paper around the globe, so that the countries and features of the globe are be projected onto the paper. Then you could trace around the countries with a pencil to give yourself a two dimensional map of the world.

How you wrap the paper around the globe makes a big difference. There are three families of map projection: cylindrical, conical, and planar, and each refer to the way that you surround the globe with the paper:

Unfortunately, every projection is distorted somehow – there’s no such thing as a perfect projection. Projections distort area, shape, distance or direction. You can have one accurate but then you must compromise on the others.

There exist many, many projections, of which a very few people have heard of most of them. This tool is an interesting way to explore some of these projections.

Here are some of the more common ones:

The Mercator projection is a cylindrical projection instantly recognisable as the “default” map projection, for better or for worse. While it is highly accurate at the equator, areas of countries towards the north and south poles are much larger than they should be.

A side effect of this projection being so prevalent is that people are often surprised when they see how big Africa really is. Being placed in the centre of the projection means that Africa is less affected by the significant vertical distortions affecting the likes of Russia and Greenland. Other projections represent Africa’s relative size a lot better than the Mercator projection does.

Many online maps use the Mercator projection, including those made by Google, Bing, OpenStreetMaps and Yahoo. Being more of a provider of local maps than global ones, it is unimportant to users that Greenland looks as big as Africa. Users are more interested in a conformal map: the scale is the same in all directions, angles are depicted correctly, and real life circles are depicted as circles on the map. In addition, the Mercator projection has a constant North direction wherever you are in the world.

Having said all this, the implementation by Google, Bing etc has some problems. This article explains these in some detail.

A compromise projection, the Robinson projection was created to “look right” rather than to measure distances exactly. It manages to fit the entire globe onto a flat surface where the distortion of areas is much better than that in the Mercator projection. The Robinson projection is another cylindrical projection.

The meridians on this projection curve gently towards the poles, which avoids an extreme distortion of shape. The flipside is that the poles become a line rather than a point – just look at Antartica dominating the southern hemisphere.

The Robinson projection has a unique history – unlike other projections, it wasn’t derived from a mathematical model. Rather it was constructed through computer simulations and a trial and error approach, and then a mathematical model devised to reproduce the end result.

A variation of the Mercator projection, the transverse Mercator projection has a variable central meridian – you can place it along a number of meridians. Typically you’d place it at the location you’re interested in mapping, since this projection is highly accurate within about 5 degrees of its central meridian.

Both the Mercator projection and the transverse Mercator projection are very accurate in the middle of their mapping region. This makes them good choices for local mapping.

Like the regular Mercator projection, the transverse Mercator projection shouldn’t be used for showing the whole globe at once. The two have similar difficulties in mapping areas far away from the central meridian accurately. Using the transverse Mercator projection won’t fix the large distortion in areas at the boundaries of the map.

One difference between the two is where the distortion occurs. The regular Mercator projection has its area distortion on the extremes of the y-axis. The transverse Mercator projection, by contrast, is distorted heavily on the extremes of the x-axis.

The transverse Mercator projection is the basis for the Universal Transverse Mercator (UTM) mapping system. This system divides up the globe into many narrow longitude bands, and then applies the transverse Mercator projection with the central meridian located at the centre of each band. In this way the UTM mapping system isn’t a single map projection, but rather a number of them, and the location of features has to be specified relative to which longitude band it’s in.

One thing about the UTM system is that each longitude band uses a two-dimensional Cartesian coordinate system, rather than the more familiar degrees system. This can be confusing to newcomers.

Place a cone over the Earth, and then project its surface onto the cone so that the angles of landmasses are preserved (also called a *conformal* projection)*. * This is the basis of the Lambert Conformal Conic projection.

The circles of latitude (or *parallels*) on the Earth that touch the cone are known as the *standard** parallels*, and at these points the scale factor is 1. The Lambert Conformal Conic projection comes in either *tangent* or *secant* forms, which means that the cone will touch either one or two circles of latitude respectively.

The diagram below refers to the secant form of the projection.

If the secant form is used, we have two standard parallels. Between the two parallels the scale factor decreases, and outside the two circles of latitude the scale factor increases.

The Lambert Conformal Conic is a conformal projection, meaning that the angles of countries are kept constant. The result of this is that area distortion is minimal near the two standard parallels, but increases as you move further away from them.

Primary uses of the Lambert Conformal Conic projection are by pilots and others who’d like maps accurate over large east-west distances. Its drawbacks means that it isn’t used for world maps, but rather for specific applications of mapping.

]]>One of the core problems in reinforcement learning is the multi-armed bandit problem. This problem has been well studied and is commonly used to explore the tradeoff between exploration and exploitation integral to reinforcement learning.

To illustrate this tradeoff and to visualise different ways of solving the multi-armed bandit problem, I created a simulation using the JavaScript library D3.js.

Click on the below image to view it!

Given a number of options to choose between, the multi-armed bandit problem describes how to choose the best option when you don’t know much about any of them.

You are faced repeatedly with *n* choices, of which you must choose one. After your choice, you are again faced with *n* choices of which you must choose one, and so on.

After each choice, you receive a numerical reward chosen from a probability distribution that corresponds to your choice. You don’t know what the probability distribution is for that choice before choosing it, but after you have picked it a few times you will start to get an idea of its underlying probability distribution (unless it follows an extreme value distribution, I guess).

The aim is to maximise your total reward over a given number of selections.

One analogy for this problem is this: you are placed in a room with a number of slot machines, and each slot machine when played will spit out a reward sampled from its probability distribution. Your aim is to maximise your total reward.

If you like, here are three more explanations of the multi-armed bandit problem:

- This article comes in two parts. The first part describes the problem and the second part describes a Bayesian solution.
- The Wikipedia explanation
- A more mathematical introduction

There are many strategies for solving the multi-armed bandit problem.

One class of strategies is known as semi-uniform strategies. These strategies always choose the best slot machine except for a set percentage of the time, where they choose a random slot machine.

Three of these strategies can be easily explored with the aid of the simulation:

**Epsilon-greedy strategy:** The best slot machine is chosen 1-ε percent of the time, and a random slot machine is chosen ε percent of the time. Implement this by leaving the epsilon slider in one place during a simulation run.

**Epsilon-first strategy:** Choose randomly for the first *k* trials, and then after that choose only the best slot machine. To implement this start the simulation with the epsilon slider at 1, then drag it to 0 at some point during the simulation.

**Epsilon-decreasing strategy:** The chance to choose a slot machine randomly ε decreases over the course of the simulation. Implement this by slowly dragging the epsilon slider towards 0 while the simulation is being run. You can use the arrow keys to decrement it by constant amounts.

There also exist other classes of strategy to solve the multi-armed bandit problem, such as probability matching strategies, pricing strategies and particular strategies that depend on the domain of the application. Also existing are different variants of the multi-armed bandit problem, including non-stationary variants and contextual variants.

I hope you find the simulation useful. Happy banditing!

]]>I found that when I read about these forces, text-based explanations never really solidified the concepts for me. To really get a sense of how the forces worked, I had to build lots of force graphs and tweak the parameters manually to see what each of them did.

So I got really excited when I discovered this testing ground for force directed graphs, built especially for version 4 of D3:

Made by Steve Haroz, it’s a brilliantly simple way to allow you to experiment quickly and easily with all the different settings available for force graphs. Use it to develop an intuitive sense of what each force does and how each of the force parameters affects the final result.

Enjoy!

Hope you found that useful! Click here to view to the rest of the force directed graph series

]]>Spatial data is data that has a spatial component – the data took place *somewhere*, and that *somewhere *is important. The data is linked intricately to a place on Earth, and that place is relevant and important and you should care about its welfare.

Spatial data is called many different things. It can also be referred to as geospatial data, or as geographic information, or sometimes as spatial information.

Spatial data usually comes in two formats: raster and vector.

Raster stores data in a grid, consisting of rows and columns of cells. Think pixels on a computer screen, or art made out of Lego. That’s what raster data looks like.

Every grid has a value. There are no empty grids. Even if a grid has the value 0, it still has a value.

Many image formats can be used as Raster format data, like GIF, TIF and JPEG files. But to be useable, they’ve got to have reference information associated with the image to specify at what location the image is at. This process of taking an image and associating location information to it is called georeferencing.

Raster data is typically useful for things like weather cover, or of vegetation growth, or other things where satellite imagery comes in handy.

Vector data is made of geometrical shapes. These shapes are also known as vector objects and are used to represent the location of features on the Earth.

Three common types of geometrical shapes are points, lines and polygons. Points are used for things like mountain peaks and wells – things that you can represent really well by just a dot on a map. Lines are used for things like rivers, roads, train tracks and property boundaries. Polygons are used for lakes, buildings, property areas, forests, and other things that you’d want to represent the area of.

You might also encounter polylines on your vectorial adventures. Polylines sound scary, but they’re just a collection of straight lines joined end to end. No more.

Attribute data can be associated with each geometrical shape. Cities could have as attributes their name, or their population, or the number of buildings that it contains. A lake could have as attributes water colour, depth and salinity.

Unlike raster data, empty spaces are allowed in vector data. The vectors show where features are present and the space around the features is empty.

Vector and raster data are very different. But you knew that already.

One difference between the two is in speed. Raster data is quicker to process than vector data. But it’s likely that your computer is fast enough to work with either type quickly, so maybe you don’t care about this.

Vector data is more compact than raster data – the files will be smaller. Your hard drive is big enough that you probably don’t care about this either.

But then you find that your raster data resolution isn’t enough for what you want to do with it, so you go and double it. Then you notice that your data file has just quadrupled in size. Maybe you do care about file size after all.

Vector data is more intuitive than raster data, and it can support topological relationships between features. Because of this it’s probably a friendlier format for spatial analysis. As a bonus it’s easy to identify similar areas on your map, like areas with the same temperature or areas with the same elevation.

Ultimately often you just won’t get a choice. Data you want might only be available as one of the two formats, so it’s important to know how to work with each.

]]>