There are a number of ways to move or copy a file using Alteryx, I’m going to showcase two approaches and talk about the benefits and drawbacks.
Run Command Tool
The first approach uses the Run Command tool which is used to run external programs within Alteryx. In this example it’s uses a Batch File to move an Excel file.
For this approach to work you need the full path of the file you want to move and the destination of the output folder.
Next within a formula tool you want to generate your command to be executed which in this case is fairly simple. The code is simply a move command with the filepaths (‘move “‘+[FullPath]+'” “‘+[DestinationDirectory]+'”‘)
In the configuration of the Run Command Tool, we want to output our command to a Batch File so within Alteryx when we hit run it executes.
The conclusion of this setup is that a file would be moved from one folder to another. Which is useful if you want to archive files after processing them. However, there is one drawback with using “move” command in this setup. Which is if the file is open this approach would fail and produce an error in Alteryx.
Run Command Tool – Copy
A way to deal with errors if the file is open is to change the command to “copy” this will copy the file even if it’s open and place it into our destination folder.
As we are copying the file it means a version of the file will still be left in the original folder where we initially copied it from. However, this could be dealt with another Run Command to delete files. Alteryx has the option in the workflow configuration to run events before a workflow or after. Therefore, you could easily have a run command file to delete files in a directory before or after running.
Benefits – Drawbacks Run Command
The run command is a powerful tool and opens up a number of possibilities and as shown can be configured to deal with different scenarios. However, because it’s such a powerful tool and can be used to run external command programs it’s typically seen as a potential IT security risk. Therefore, any production processes that would want to use this approach, would normally need to engage with their IT security team. This approach does have a learning curve as users may feel daunted with coding a run command.
The second approach you can use to copy files to a new location is to use Blob tools in Alteryx. Essentially you feed in the input path of the file to be ingested into Alteryx and it’s converted to a Blob data type with a Blob Input tool. Then you feed the output file path with the blob of your file from the Blob Input tool in the Blob Output tool. This mimics the Run command Tool approach of copying a file.
For the Blob Approach you will need the full path of the Input and Output location with the file and correct file extension.
The Blob Input Tool configuration is relatively simple you will need a File Name, change Modify File Name By to “Replacing Entire Path with Field” and in the drop down select your field.
The Blob Output tool follows a similar configuration however it requires a Blob field to work. However, since we are using the Blob Input tool to convert our file to a Blob we have got this covered.
Benefits – Drawbacks Blob
The Blob approach is simpler to setup and learn when compared to the Run Command approach. However, while the Blob approach mimics the copying approach of the Run Command Tool. It would error unlike the Run Command Tool approach if the file were open. It should be stated that an IT security team would probably be less concerned with the Blob approach. The reason being, that copying a file is the only thing it can do in this scenario and it’s not utilising, an external command program to accomplish this task. Therefore, would be less of a risk of this process being used in malice intent.
In summary these are just two approaches to moving/copying files with Alteryx. As you can see there are benefits and drawbacks to consider when adopting these approaches.
Hope you find this blog helpful and can see that you can use Alteryx for a variety of scenarios. Please leave a comment below or reach out to me on Twitter @JosephSerpis or on the Alteryx Community and Linkedin.
I was really fortunate to be one of the few to receive an early preview of this awesome book before it was released in July of this year. What I love about this book, is that it pulls together a lot of different techniques around the topic of data engineering. It also takes those techniques from a beginner view point right through to the advanced concepts, with great examples along the way. It teaches Alteryx users DataOps and it also helps engineers learn Alteryx, so whoever you are you can really gain a lot from this book.
Why is this book needed in the marketplace and what has Paul brought to the table?
Alteryx users tend to be people who are data driven and then end up becoming Data Engineers alongside their day job, or Alteryx grows within the business that Data Engineering departments are born to enable the growth of data within the organisation. Paul, has clearly seen both scenarios take place over his years of using and sharing Alteryx and really thought about all the techniques which could help individuals and teams to really deliver great results within the platform.
It’s not just one book it’s THREE in ONE!
Data Engineering with Alteryx has been split into three parts, Introduction, Functional Steps in DataOps and Governance of DataOps.
Of course, Paul introduces the reader to Alteryx Designer, Server and Connect, but what is great is he also introduces InDB tools within the first few pages. This is an area of Alteryx which isn’t utilised enough! The more Alteryx users understand what these tools bring to the table the better. Utilising the power of your own databases before the data even hits Alteryx, your database servers are often power houses of processing power and memory, the InDB tools put this power in your hands.
He then goes to define a data engineer, at the time I was leading an data engineering team within Avon and boy he does nail that definition.
“As a data engineer, you are an enabler.”
We were enablers, we drove change, made people’s lives easier and enabled the business to do things they were not able to do without the help of Alteryx and a proactive team of engineers. This simple sentence made me smile.
What I really enjoyed about the introduction section is that Paul really took the time to explain DataOps and putting that People pillar first is absolutely key. I tend to think of a more traditional approach of People, Process and Technology, which is a very systematic view; with the use of Agile methodology as a method of delivering processes and technology. Paul really focuses on running a data engineering team and the elements needed here around People, Delivery and Confidence. It was a great read.
Now for the practical part, the best part of Alteryx is that its practical hands on learning.
Part two of the book covers functional steps of building data pipelines and delivering them to databases as well as advanced analytic processes such as spatial analytics and machine learning models.
With graphical images and clear descriptions Paul guides the user through the latest version of Alteryx on how to get the most from their data connections and even details on what an API is and how to utilise them for data mining.
The simple stuff: most users don’t have a computer science background so the fact that Paul describes concepts like UNC paths and relative paths and why we use them it’s great to see what a conscientious writer he is. Great job Paul. 😉
Throughout this whole section there are great examples, guides, graphics, key terms, code snippets, etc. that would guide both the beginner and the expert through key features that are useful in data engineering.
Now for the final section of the book, part three, Governance of DataOps and where the real gems are.
It takes a while to discover some of these tools and techniques mentioned here for most users, but its great that they are the first things that Paul brings to light. He also pays great respect to the CREW macros, developed by Adam Riley of Alteryx but loved and looked after by Mark Frisch (MarqueeCReW), Alteryx ACE, community leader and all-round good egg.
There is a huge amount of detail around the Alteryx Usage Report (how to set it up), accessing the data within the MongoDB (database of Alteryx Server), using Git and GitHub actions for development (continuous integration) all of which are very topical right now within organisations.
This section enables Alteryx users to be able to develop into production. It also helps you understand how users are utilising the server or apps, to redevelop or stop maintenance on items which are no longer required, review errors to redevelop. For those new to Alteryx Server, there is a whole chapter that explains governance and setting it up that will make it a breeze to getting started. This book opens up all the fundamentals for users to be able to get the most of Alteryx within the field of Data Engineering. I particularly loved the section on Git and validating XML with python. (But that is the developer in me 😀)
Overall, this is a book I would buy and have on my shelf to share with those I am introducing to the world of Alteryx, as a kick starter guide to developing data in the Alteryx platform. It is a perfect book for those starting out and those wanting to become more confident across the whole platform who may have only had exposure to Alteryx Designer before Alteryx Server.
You need a copy of this book if you are:
Leading an Alteryx team – for development of your team
Alteryx Champion or a budding Alteryx Champion – for development of your platform knowledge
Brand new Alteryx user – for development of yourself
If you like giving geeky gifts, this will be gratefully received by all who receive it. You have my permission. 😉
Alteryx is amazing at processing data fast and especially when performing spatial analysis. Let’s take a fairly common scenario of analysing points that are within a trade area or polygon. The rationale for such analysis can be numerous, such as looking at how many potentially customers are in the near vicinity to a store or determining how many people can be served within a drivetime or distance.
Alteryx is more than capable of serving this function in the below image is a mock up of said approach. I have two inputs one being a SQL table of postcodes with Latitude and Longitude and a Spatial object. The other input is a map input of a point that I use later in the approach to create a 1Km trade area/polygon. I use a Spatial match tool to determine how many of my postcodes are within the polygon.
The answer to the question is 5413 postcodes sit within my 1Km postcode and it took 18.7 seconds. I should caveat that hardware does play a factor so the time may increase or decrease depending on hardware. However, this is still a fast time spatial analysis is well known to be a compute intensive task.
It might seem unlikely that this time could be improved by it can. The biggest drawback of the previous approach or approach A is that all the Postcodes of the SQL table are brought down from SQL into memory of my machine hardware. This takes time and resources a more efficient and faster approach is using the Dynamic Input tool. This approach takes only 5 seconds and achieves the same result of 5413 postcodes within a 1Km Trade Area.
How is this achieved you may ask well the answer is within the Dynamic Input Tools. One of the options in the configuration is to use a Spatial Filter. The requirements to use this function is first a polygon to be utilised which I have in my workflow and a Latitude and Longitude field in the SQL table you are using.
What the Spatial Filter does is use a polygon from the Incoming data stream to analyse if existing latitude and longitude from the SQL table are contained the polygon. Essentially, we are pushing the analysis onto the SQL database and pulling only the results we need. Hence why see a nearly 14 second difference to achieve the same results. As we are not pulling first, all postcodes from the table then performing the analysis.
Approach B does only work if the data is in database that supports spatial functions such as SQL. However, given a significant portion of big data is spatial data it is typical that this data would reside in a database. Therefore, Approach B can significantly improve the performance of your spatial analysis allowing you to gain insight quicker.
Hope you find this blog helpful and seen how easy it is to improve performance of your spatial analysis in Alteryx. Please leave a comment or reach out to me on Twitter @JosephSerpis or on the Alteryx Community and Linkedin.
Building trust in data in any organisation is always a challenge. When you are working in a fast paced environment and the next change in data is happening before the last one has been signed off, you need to have a sure fire way of keeping on top of your data validity.
Here is why Alteryx, Alteryx Server and an Alteryx ACE are super perfect for this task.
When you have access to your base code and KPI definitions and you need to compare them against the reporting layer. In all organisations data can be messy, it can go through a number of data points, transformations and when it lands in the final destination it can be very different from the starting point. Therefore data validation is key to ensure that data is monitored and kept fresh for your end users.
This does of course mean you are taking ownership of the data’s accuracy, because without ownership you can never raise the profile, gain the access you need and develop the relationships within the organisation to fix any data issues you may uncover. So make sure you are in it for the right reasons.
Validation through data points does have a particular design pattern in Alteryx that you can follow, its a standard one and you may even recognise it from some of your workflows already.
Firstly you need to understand your KPIs and your definitions of these KPIs before you jump in. i.e. Sales = Gross_Sales – Returns, so that when you are calculating these at each data point you have the correct detail.
Secondly you need to know the architecture of your data flow within the business and have access to these systems so that you know how many data points you need to test.
Thirdly do not try and go to granular with this, keep high level, i.e. monthly or totals across each data point
Lastly, you need access to the data points, Alteryx will do all the hard work but make sure you have permissions and your mission is understood for the benefits to the business. (such as ensuring the data at the reporting layer is accurate)
As you can see we first lookup the current period as this runs daily, depending on your business you could use a Date Time Tool here but sometimes that is not possible for example you may run marketing campaign style where you have a business period of two weeks or two months, so this is why the initial input depends entirely on your business. The formula tools are only there if you need to specifically run the validation for a particular business period, its a quick way of not having to change your workflow but having the tools in place to do so, if you need to.
Then with the dynamic input we grab the KPI definitions at each data point. This can be multiple places within the organisation, so you can see here I have two but the reality is that its more likely to be three or four in any business.
Before I go onto transpose I just wanted to show you an example of the data before and afterwards so you can visualise what is happening here.
Transposing the data is important so that you can have as many KPIs as you need as rows, ensuring that the business period is your KEY, KPI Name is your NAME and the KPIs are your VALUES. As we add data points we can then use a JOIN tool to ensure that we make it as easy as possible to compare them across systems, as we all know its easiest to compare them left to right.
Now an important factor for your join is that you need to join on the business period value and the Name of your KPI, this is why it is imperative in your SQL Queries within the dynamic inputs you call your KPIs the same name, even if their names within that system may be different. Personally I always use the reporting name as the KPI name and the Source data system as Data Point 1. Why the reporting name and not carry out the source name through to the end? Well this is purely down to the business users / end users at the end of the day. If you are to be visible about this information, which I encourage, they will understand the reporting name of SALES, PROFIT versus BI34, FIELD0192, for example.
Within your join, rename your VALUE fields from your two data streams to the name of your Data Points and ensure that your source system is always the first, then the next data point is your second, etc. Once this is complete it is to the Formula tool, where a few columns need setting up so that your output looks like the following:
Within your formula tool set up the following:
Difference – This can be absolute or just pure difference (Data Point 2 – Data Point 1)
% Error – 1 – (Data Point 2 / Data Point 1)
Validation – if Difference = 0 then “VALID” else “INVALID” endif
Validation Engine – “Alteryx”
Date – DateTimeNow() – set as a Date datatype
Time – DateTimeNow() – set as a Time datatype
Now all that is left to do is set it up to run once a day on your Alteryx Server and save it to your Reporting database and overlay a visual on it for your end users and yourself. You can also extend this by sending an email to your data team should these results be “INVALID” so that they can act up on it immediately. We have an amazing dashboard that shows us source as 100% and then the reporting layer at 99.5% so that we can see the timing differences and the potential gap in the data whenever we need, its perfect for validating what we need before we embark on a data project.
So today we have learnt how to quickly set up validation between your data points within your organisation to build TRUST within your organisation on data within your reporting layer using Alteryx and Alteryx Server. Just remember it is only quick if you have access to the data points in the first place. Enjoy owning your data.
The next post in the series is going to help you think dynamically within Alteryx and when or why you would change an input tool to a dynamic input tool.
Welcome to part two of the automation series, here I will share why it is useful to share details of the automation tasks you are running.
Firstly this is for those organisations looking to set up Alteryx Server and make the most of the automation, before you get carried away a number of small tips are useful to get set up to ensure success.
You need to design a repeatable process, a macro if you will that you can add to the end of every workflow that you run. What is this macro for you ask. Well its going to do a number of things.
Record the number of rows
Record the name of the process
Record who ran the process
Record the date and the time the process ran
Record the connection detail
Record personal details used within the process
Within an engineering team its best to build a macro to deliver this process so that it is as simple as dropping the macro onto the canvas at the end of each developed workflow before uploading it to Alteryx Server.
So in order to achieve this you need to utilise a number of tools within Alteryx, mainly the interface tools and saving it as a Standard Macro (.yxmc)
Useful tips for the areas mentioned above:
Number of Rows – use a Count Rows tool – this immediately takes the data fed into the workflow and changes it to a single result of total number of rows.
Name of process – use a formula tool with the engine constant of [Engine.WorkflowFileName]
Who ran the process – within the same formula tool add a further column with the following GetEnvironmentVariable(“USERNAME”) this uses the common system commands to grab the current user logged in.
Record Date & Time – formula tool again, this time DATETIMENOW() is your friend with the correct datatype selected, i.e. Date for Date and Time for Time 😉
Connection details, using an interface tool, design all the possible connections into a drop down to be selected by the end user to ensure that the correct terminology is used.
Selecting personal information is the final one, here I used a combination of tricks, selecting from a list of given options which then creates a list and replaces the information in the text input through an ACTION tool. (Hint below)
In your main workflow you need to add a block until done before the final output tool. Off the first output of the block until done, add your output tool, then from the second connection, add your new macro. Why? You ask. Well this is because it will then only write to your database table once your workflow output has been successful. This will then give visibility to the end users who need to know if their data has run affectively this morning before they carry on with their every day tasks.
There are other useful tips to your disposal here. You have the Events tab within Alteryx workflows, should your workflow fail at any point you can then configure this to email you so that you will always be aware that this particular workflow needs your attention. Also Alteryx Server has a fabulous Tableau workbook that you can download and use as a way to dig deeper on the visual side of what has run, what hasn’t run etc. This is all fed from the underlying data in MongoDB on Alteryx Server but it does need refreshing frequently to see the results, so you need to schedule it.
Overall why is this tip important? To enable visibility to your end users of all kinds, via SQL, via Tableau or via an export to email.
Ensuring all of your scheduled jobs are running this process enables your business users to know and see and trust that the data is up to date and ready for the day ahead. This builds trust within the organisation, it also allows you to record what applications are using personal information and how often they are run and how much data was extracted at any point by whom. You have a great log within the database on Alteryx Server too, this is just a way you can control and record the details you would like your end users to have access too, and to satisfy your legal departments.
The next post in my series will be on Validation of your data and why its important to setup a regular scheduled job to have this detail in your back pocket. 🙂
Welcome to the Alteryx Automation mini series on Intersections and Overlaps. Starting fresh with Alteryx Server and Designer within a new organisation reminded me of all the small useful things which I had set up that I took for granted. So this mini series is for me to put pen to paper and share those useful tips with you all.
This first one is about having a lookup table within your database of the current business period and the previous one and why that is useful when data engineering with Alteryx and Alteryx Server. There are many ways to do this and I have kept this example simple to follow along, but for some organisations dates and times don’t follow monthly patters and cannot simply use MONTH() or YEAR() functions they have to build it on 4 weekly patterns or even 3.
When running automation sometimes its easier if you can delete the data of the current period and replace it with the fresh data from this period. Well you are in luck as this blog will share all.
Ideally you need a lookup table in your database which looks like this:
A useful lookup table
There are many ways to achieve this here is an example which supports one where we lookup from the business data itself. This depends entirely on your business period. If your business does follow a simple month pattern then you can use a formula tool to create it and it it only needs to be run once a month on the first of the month. However if you are running a complex period and require regular change, run the process once a day to keep it fresh and create those rules but ideally you are trying to achieve the table above in a location your can access with Alteryx and Alteryx Server.
Internally you can include all the different ways you record your data in each system in this lookup table to ensure you have them available for each one you interact with. This becomes powerful in the next part for when you are updating your data. For example one system may record this business period differently to another, if so include both and describe them using their system name so that when you are using it in the next part it is simple to select the correct one.
Now we have our lookup table handy, we can put it to some use:
Within the OUTPUT tool configuration lets look at the three elements which are important to writing data in a database where you want to delete the previous data and reload it with the current view. I will explain the purple highlighted areas on the left in the image above:
Connection: ensure you are connected to the database and selected the table you are updating the data in.
Output Options: Select here the action you would like to take place with your data after your PRE SQL statement. Here I have selected APPEND EXISTING because I want the data to be added to the table in the database, not drop the table, etc.
Pre Create SQL Statement: This is where it all comes together, PreSQL statements are statements you can run before any actions take place with the data on your database. Think about the action you want to happen when you add more data to your table. Do you want to add more data to data which already exists? No, you need to delete the current data first… so let’s take a look at how we might do that.
Delete from “TABLE” where YYYYMM in (select YYYYMM from “LOOKUP_TABLE” where DESCRIPTION = ‘CURRENT_MONTH’)
A Simple PRE SQL STATEMENT in Alteryx
The LOOKUP_TABLE in the DATABASE
Now looking at these two elements side by side, we have a lookup table that stays up to date with the current month and previous month codes and a PreSQL statement which enables us to grab the current_month code and remove the data for the current month before inserting the latest view of that month. Let’s also note that I have simplified this for the blog and not used internal codes for sales periods which are more complex than that of a simple YYYYMM format. Therefore having this lookup table enables you to quickly grab the detail you need without having to write the same code in multiple places. If you did need to update this code now, you only have to go to the one workflow to update rather than multiple workflows. I know right, always thinking of my future self!
If you wanted to create a second workflow or run it once for the replacement of the previous months data, you only have to change description in the PreSQL from the ‘CURRENT_MONTH’ to ‘PREVIOUS_MONTH’ and if you needed to change the previous and current month’s worth of data each time you only need to change the symbol to greater than the previous month and that is possible too!
There are many ways to use and benefit from this but for me this was one of the first things I implemented as it makes sense to have control over the data I was pushing into the database through my automation workflows using Alteryx Server.
The next post in my automation series will show why writing a record into a table every time you run some automation in your database for your end users is the right thing to do.
Alteryx uses ISO format to represent dates any other formats are read as a string. This link provides a list of all the specifiers you need when do date conversations, which is handy as they are case-sensitive. Also provides a list of the date time functions you can use on your data such as DateTimeTrim which is very handy as it has the option of lastofmonth.
Do you need a way to document your workflow and share how it works with others who don’t have Alteryx? Then this tool developed by Keyrus “Shameless company plug” and shared on the community is for you. I’ve used it to on-board new users and for compliance and audit purposes and for work I delivered as a Consultant at Keyrus.
As I help organisations to adopt Alteryx the question of how often should I update typically comes up. The version support policy is typically where I point them to as each version of the Alteryx product, has technical support for a period of 18 months from release date. It also provides details of what version of R is used in Alteryx and Mongo for those who have Alteryx Server.
When the latest version of Alteryx is out during valentines and the girls are happily playing… why not upgrade your setup?
Supplied with my latest ACE License Key (Thanks Tuvy) I set to work. Personal machine with full god powers. AG you crack me up!
Firstly logging in at 15:49 to download the latest version, this was completed 15:53. Total of 4 mins to download over a pretty decent internet connection. Today I think I am going to ensure predictive, R and Python are all installed and working without any conflicts… as its been a while since I implemented this setup and personally I want to see if I still have the touch! Plus this is really handy for those new into Data Science wanting to know what to do.
Step 1: Download Alteryx
Personally I found a small delay between opening the install file and the unpacking to start, which lead to my impatient fingers clicking more than once. Be aware! Oh impatient ones. For those with Alteryx already installed this installer will remove your previous version for you.
Because some places need click by click documentation I am going to include all the images required for the whole installation. I know this is tedious, but if this saves someone having to get hold of a test machine to unpack and screenshot everything, consider this a gift.
Step 2: Download Predictive Tools for Alteryx with R 3.6.3
Useful notes from the installation:
Python Version is Python 3.8.5 – location of installation folder: C:\Program Files\Alteryx\bin\Miniconda3 R Version is 3.6.3 – location of installation folder: C:\Program Files\Alteryx\R-3.6.3 Always worth reading the release notes: Server 2021.1 Release Notes | Alteryx Help
Step 3: Configure RStudio
I already have R Studio installed on my machine so I only had to update the folder in the global settings to be able to run my R code again. If you don’t you will need to install it, however don’t install R itself as you already have this installed under the Alteryx folder structure and you can point RStudio to it once installed. (Tools >> Global Options brings the following window)
Once you have done this you do need to restart R Studio in order to take the new location into effect before you rush off, its best to try some R code. Here is a bit of code I like using to try:
##GRAB SOME DATA WITH R
x <- data.frame(cars)
##PLOT A GRAPH
#AlteryxGraph(2, width=576, height=576)
abline(lm(cars$dist ~ cars$speed))
This code is written both RStudio and Alteryx friendly. The comments can be uncommented for Alteryx but the code itself is ready for RStudio. Its always important to test both your environments are working and that you get the same results from both so this simple test should help you on your way.
Perfect setup for a day playing with Alteryx.
New Features which I have noticed in 2021.1 of Alteryx
The amazing double click to add a tool to the canvas. This is an absolute time saving gem. When you know the flow you are creating and you just want to get on with it. These time saving features are perfect. Awesome work guys. I personally love the F2 shortcut, when you are on a tool, this will take you straight to the Annotation section of the tool with the text highlighted ready for updating. Again another perfect way to support the users to design their workflow with notes from the start at speed. I have seen there are a few more shortcuts now. Worth updating yourselves for your CORE certifications… shortcuts are here. The download tool now has the Throttle tool built in. I look forward to testing this one out on my webscraping workflows.
New features which I have noticed in 2021.1 of Alteryx but were released in a prior version
I upgrade my version from 2020.3. So the features I have noticed are under the Results – Actions is now a consolidated menu in which the Save, Copy & Paste and New Window options sit. This one does make this area look smarter however I think its adding an extra click to those who love to throw out the data straight to Excel. (Yes, users still do this) I would love to see a programmable button here where you can have an output already configured and just click it. Making it super speedy to see that data.
I hope this helps you all internally when working with IT get your setup working much faster. Tried and tested right here for you. Happy Valentines 2021. *o*
There are so many options when it comes to geocoding in Alteryx, you can build a workflow to use Google APIs to look up your postcode and feedback the answer for you. Lots of blogs around relating to that, but actually building your own using open source data that doesn’t require an internet connection every time you would like to use it, now that is a macro for your suite of tools. Buckle up my friends, lets go.
Firstly you need some data, of course this is being written in the UK, so I am going to use UK postcodes. The Office of National Statistics is an epic treasure trove of data. Here is a link to the Postcode Directory for August 2020. This dataset is updated every 3 months, so you will need to keep that in mind if you want to keep on top of newer postcodes but the process we are building is going to support that.
Before we carry on let me explain the process we are going to undertake. We are going to create three small workflows. One is to build a Calgary database (1) which we will then use within the Macro (2) we are going to build and finally we will use the Macro in a workflow (3).
Step 1: Building a Calgary Database
The steps are connect to the csv data set, which can be found in the “DATA” folder from the ONS data you downloaded. (Downloads\ONSPD_AUG_2020_UK\Data)
Create a new field, which removes the spaces from the pcd (postcode column in the ONS data) so that you can have a field which will support more matches when joined, as users tend to type postcodes differently and always get it wrong this is the best way to deal with matching on postcodes.
Then we select the fields from the ONS dataset that we want to use in our database. It is always useful to have the lookups available within your geocoding tool, so I have included the Output Area, Lower Super Output Area, etc. However the Latitude and Longitude fields are a must for this tool, as well as the new field we have created to help us match against human inputted data. (strippedpc)
We then add the Calgary Loader tool, which will go ahead and index and store all our required data in a smaller compact and quickly accessible database. Make sure to save this somewhere accessible by all.
Step 2: Geocoder Macro
A macro input required with just a single column and row with an example postcode in it. You can use your own, or perhaps your office building postcode. I called this column postcode, and set up the question as such for the user to understand what is required from them in order for the macro to work.
Then as this is the user side, we need to remove spaces from any postcode they enter, as we have already mentioned it is likely they haven’t entered it as the standard requires. so the best way to get a match is to remove all spaces and match without them with the strippedpc column you made in your Calgary database.
Now its the best step, adding in the database, this isn’t your normal input tool. You actually need a Calgary Join tool. Connect your user entered data to the input anchor of this tool then set up as follows:
Here you are joining the data from your user, which could be a list of 1 or a list of 1 million records, to your calgary database you have created. When you have your configuration options up, you are selecting your user created strippedpc and then the index field of strippedpc with the query type of Exact value. This provides a quick way of finding your matches.
Then we join back onto our original data using the stripped pc to make the join.
From our matched data we want to use the Latitude and Longitude data and create a point for it, so that we are officially geocoding.
Then we have two Macro Outputs. One for the matched data from the join, then one for the unmatched so that the user has full control of all the data they have entered at the start of the process and know which postcodes are “not valid”.
Before you save your Macro, don’t forget to give it a helpful name, some metadata (a description) and an icon, so that users can pick it off a toolbar knowing fully how to use it well.
Step 3: Testing your Macro
Finally put in a postcode within a text input and then attach your macro with a couple of browse tools. Run to test and view your results in your Geocoded results, as you can see my postcode test was Buckingham Palace.
Successful job, we have just created a geocoder, but one that you can keep up to date too.
Testing 1 record takes 0.5 seconds, 26,472 records takes 1.2 seconds or 2,467,046 in 30.9 seconds which is a great test of speed. Thanks for following. *o*
In Part 1, we introduced the Location Optimiser macro and how the p-median based approach for solving location-allocation problems is shown in the Alteryx sample workflows (trying to minimise the overall weighted distance that demand points have to travel to a set of potential facilities). In this part, we’ll give some examples to bring it to life a bit, and discuss some of the different approaches to simply minimising distance.
The approach that may be needed will depend on how the location optimisation problem is framed. The minimise impedance problem was discussed in part 1 of the blog in the form of the p-median problem. This approach is designed to minimise the sum of weighted costs between demand and facilities. Another approach is to maximise coverage in some way to capture the most demand from a fixed number of facilities within a set distance or time. The result of this approach is that the most demand (people, patients, fires, crimes etc) within the distance cut-off will be selected (a set cut off means that some of the demand may be excluded if there is insufficient demand near to a facility). In both of these, you need to know how many facilities that you want to add. In cases where you don’t know this, the problem could be framed as minimising the number of facilities. That is, we are trying to maximise the coverage whilst minimising the facilities. For a bit more reading on these, see the useful overview in the ArcGIS network analyst help.
In practice, we may be asked questions that require us to re-frame them into one of these location optimisation approaches. Let’s imagine that we were launching a new brand into the UK. We might want to see “how many locations would we need for 80% of the population to live within 30 miles of a store?”. This is really a minimise-facilities problem, but with an acknowledgement that 20% of the population are going to live further away than 30 miles. Whilst we are trying to minimise the facilities, we are actually trying to maximise the population counts at each location by choosing those that are most densely populated.
Of course, we want to use the location optimiser macro (which is what the blog is all about after all!), but you’ll remember that one of the inputs for this is the number of facilities we want to add to the network. This is exactly what we want to find! In actual fact, we can use a combination of an iterative macro to estimate the number of facilities, and then a location optimiser macro to find the best (the most population) combination of n facilities. This saves us running multiple configurations of the location optimiser which takes some time, whilst also making sure that we get the best overall combination of facilities.
First, we need some demand points and some supply points. The former is just small-area population counts (for example Output Areas or LSOAs from the census which are displayed as centroids). For the latter, we need to create some potential facilities, as we don’t actually have any locations in mind. We can do this with a 10km grid around the UK coastline, and then extract the centroids with the spatial info or formula tool).
Estimating the number of facilities with an iterative macro
We want to use this macro to, run by run, select the facilities with the highest population within 30 miles, calculate the running percentage of overall population and then see how many facilities to run the location optimiser for. The challenge is that in Alteryx, iterative macros which have to run spatial processes can be (painfully) slow. However, we can design our macro to run entirely using joins and numeric calculations, doing any spatial work outside the iterating itself.
First, we create a large long-thin matrix table containing the supply and demand ID pairs (one demand point can be in many supply point’s 30-mile radius), the demand for each demand point, and the total demand for the supply point. In our example, there are almost 14 million pairs.
On the first run, we want to choose the supply point that has the highest total demand and use this as our first facility. Here, 12.5 million people live within 30 miles of the supply point, spread across almost 39k demand points.
Next we need to burn down, or remove any of these demand points from the looping data. This is so that once a demand point is allocated, it can’t be used again. We do this by joining all of the supply-demand pairs to the demand points that are used, and only taking forward the unmatched pairs.
If we were to map the remaining demand points, we will see a large hole around London, the most densely populated part of the UK:
The final thing is to re-create the input table of supply-demand pairs, and re-calculate each of the supply point’s total demand for the next iteration. In doing so, the order of the most densely populated supply points is changed. Many of the supply points having a high proportion of demand that was removed after the first iteration are now the least densely populated and are no longer viable to use as a facility. Accordingly, in iteration 2, the supply point at the next most densely populated part of the UK is chosen, and so on. Here is the macro in entirety:
If we calculate the running percent of population coverage for each iteration, this gives us a similar table to before, and we can map the facilities to use in order to reach coverage of 80% of the population within 30 miles. In this case, its 15 facilities.
This all runs in about 8 minutes, much quicker than the compute heavy location optimiser would be for assessing multiple facilities. Here are the final locations selected:
This method is really a brute force approach with cumulative population density above all else, in a single order. In fact, a better solution (higher population overall) could be reached by a different combination of facilities which have less overlap. The iterative approach does not allow this. So…
The location optimiser
With the output from the iterative macro – 15 – we set this as the number of facilities to find. Inside the location optimiser macro, we use the spatial match tool to find which demand points are within the 30 mile trade area. As the macro randomly chooses which 15 facilities to be part of the solution, we may have situations where a demand point is within 30 miles of more than 1 facilities within the solution. We can remove these duplicated demand points with the unique tool (it doesn’t really matter which facility they are assigned to in this case, since we are assessing the random selection of 15 facilities as a whole).
This assessment or scoring is done simply by summing up the population within the solution, and we want to change our macro settings to maximise the population who live within 30 miles of the set of facilities (a higher score).
Then we can calculate the % of the population that the best 15 facilities has within 30 miles. The table below compares this to the iterative approach, and as you can see, whilst its pretty marginal, the location optimiser does find a better overall solution.
UK Total Population
% of population
As might be expected, the result is basically a map of the major cities and their surroundings, or where the highest combined population density is. This took about 15 minutes to run. Incidentally, a solution with 14 facilities almost reaches the cut off – 79.1% (again, marginally higher the the equivalent using just an iterative macro).
You’ll notice that the two approaches actually lead to quite similar results, both in terms of where they find and the number of locations needed. However, using the location optimiser finds a slightly better solution (81.1% vs 80.9%) using the same number of facilities. This is because the location optimiser finds the bestoverall solution, whilst the iterative approach is cumulative and has only 1 possible outcome – there is no flexibility in the combination of facilities that it will choose. For minimise facility problems, a combination of the two may work well to reduce computation time, first running an iterative approach to reach the likely number of facilities, and then running a location optimiser macro.
Obviously in reality, constraints of existing locations, capacity, closeness to road links and costs (and many more) play a much more significant role than this simple example, and using vehicle journey times would likely yield a different result as well. Nevertheless, this hopefully gives an idea of how location optimisation is a really powerful technique that can be used to solve many typical location-allocation problems in a few different ways.