Posted on June 4, 2022 by Jason Wright

Updated: June 5th, 2022
Originally Written: January 18th, 2019

When I approach a website design project, I start by looking at the foundation, the assets, the content, and the overall structure of the existing site. Doing this gives me a full picture of all content available to a user. Clients typically build sites over time (this is normal). As that happens, more content is added and a site can quickly bloat into a less manageable mess.

Understanding the architecture of your website can be a challenging task. There are so many moving parts on really large websites. The value you obtain from running through this exercise, however, gives you insight into how any website is structured. I tend to find lots of content hidden in the basements of deeply linked pages.

The following information will walk you through my approach to analyzing and stripping down an active website, running crawls, and filtering data. I’ll also share how you. can apply some of this knowledge to your advantage!

Step 1 – The Crawl

Grab the URL you want to review and toss it into an app like Screaming Frog. You may need a license to crawl a site that’s larger than 500 pages. I’m picking something easier for this example. Pasting the site’s link and hitting “Start” will begin the crawl process. There are lots and lots of amazing features, preferences, and more inside Screaming Frog, but this post isn’t about those goodies, for now 🙂 … you can use any tool really. Just make sure you’re using a reliable crawler so you can grab all of the page links.

The crawl should look something like this:

Step 2 – The Export

Export the HTML data from Screaming Frog. Hit the drop-down and select HTML, then export the data to a CSV.

I like to name my files something like Jane Does Plumbing Architecture.csv.

Upload your file to Google Drive (you can follow along if you’re using Excel).

Step 3 – The Initial Grind

IMPORTANT – Open your recently uploaded file and DUPLICATE the tab/sheet first. Rename the original tab as “RAW”.

Name the duplicate sheet something like “CLEAN”.

…and now comes the fun part. When you have a reliable crawl uploaded you can start filtering the data down into chunks that your project team can understand.

Prepare the Destruction

There are several key sub-steps in this process. They are as follows:

Delete columns that you don’t find critical in a typical crawl like content type, inlinks, outlines, ratios and that kind of stuff. Keep everything you generally use like titles, descriptions, crawl codes (404, 302, etc).
Set the primary table column header row as filters for the sheet. This allows to keep, but hide data like broken or missing pages.
Filter out 302’s and 404’s.

Step 4 – Categorize the Findings

Insert a new column into the sheet and call it “category”. This is going to play a major role in breaking down the document.

If you’re still with me, then it’s time to categorize the content. The naming conventions you use are totally up to you. To get things going, I always sort the sheet by the name of the URL’s. Doing this groups up pages making it easier/faster to make big updates to categories for pages.

The goal is to have this file serve as the foundation for a client-facing website sitemap that meets the needs of UX/UI designers and digital marketers.

The Categorizing Approach

This part is rather straightforward. Through various sorting methods, I categorize pages in the URL column. In the end, I want a sheet that’s grouped similar pages together via labels and color-coding. This sorting allows me to understand everything that’s on a live site.

For example, I’ll go through and look for dynamic URLs for blogs like “blog-page/2”. These are dynamic URLs and I categorize them as such. I’ll take content under the “About” section and generally name all “About” type pages as “Main-About” in the category column.

When all pages are sorted, I’ll color code the various category groups to make things a little easier to scan.

What All This Means

When you’re done labeling and coloring the heck out of your sheet, you’re ready to take the next step. Depending on your role, this could mean a number of things. Like understanding the word count on specific pages to breaking down a future version of a sitemap through a combination of keyword research, new content plans, and existing content transfer plans.

This process helps you take a more detailed and forensic approach to website architecture analysis. In a later post, we’ll merge this data with other data sets to give us a powerful view of the performance of a website.

Here is a summary/checklist of actions from this post so far:

Crawl the site using something like Screaming Frog.
Export the data and upload it to Google Drive.
Duplicate the tab/sheet so you have the original data set.
Rename your tabs/sheets to “RAW” and “CLEAN”.
Make sure you’re now on the “CLEAN” tab/sheet.
Clear out columns you don’t want.
Filter the data using the first row of the sheet as your filter set.
Hide 404’s, 302’s…basically anything that isn’t an “ok” response code.
Create a new “Category” column.
Start grouping up content using categories.
Don’t give up!
When everything is categorized, color code it!
Now you have a data set that you can analyze further and convert into a sitemap.

Step 5 – The Sitemap

While I won’t be able to get to everything today, the sitemap is what helps guide the rest of a website project. All of this might seem like overkill, but this upfront work will reduce the time and energy required for the back half of the project (development). Our goal is to plan as much as we can upfront, then let the machine, the system, and the people do the work uninterrupted. This leads to some of the best results and products.

We currently use Google Docs and/or Whimiscal for website mapping. Whimsical is a great visual tool and frankly, you should try to use something visual as often as possible because it just makes things so much easier for all parties involved.

Rather than explain “how to make a website sitemap,” I’ll just share with you some of the things I consider when I move data from a sheet to a visual website sitemap.

Remove things that look old and outdated.
Consolidate!!!!! This is the big one. A lot of times there can be 7 pages with a few sentences, why not make it a single page if you can? The mission statement, vision, and company history can be on the same page.
Can specific sections turn into a single section?
Can you dismantle the navigation to the point that there are only 3-5 main navigation items (your top navigation)?
Are there opportunities to build out more content?

These are the most critical items that come to mind for me when I’m putting together a website sitemap. Hopefully, you found this helpful and I’ll expand on this more in the future.

As always, thank you for reading and I’ll see you again soon!