LOD equivalent of INDEX and RANK (part 2)

In my last blog post I looked at a LOD equivalent of RANK / INDEX table calculations in Tableau. That approach was limited to ranking a very small range of whole numbers, and left me considering other options. This post outlines another LOD approach using spatial functions!

WARNING: As before I will stress that you should rarely need a LOD-based equivalent of rank or index, and can often use table calculations when you don’t think that you can. That said, there are scenarios where a LOD equivalent can be useful: onward use of the calc or use in spatial functions being the cases I’ve seen on the Tableau community forums. And it’s also a fun challenge!

To revisit why this is a tough (fun!) challenge we have to remember that LODs really only break out of the viz level of detail to a higher or lower level of detail. So if you’re working at a month level of detail, a LOD expression can get you the figure for that month’s quarter or year, or it can work with figures at the day level. However we would usually say that there is no LOD expression (or indeed level of detail) that equates to the rows before or after any given row. And that makes finding an index or rank difficult with LODs.

The previous solution I wrote about takes any value and switches on a 1 at that value’s position in a string of 1s and 0s. So 5 = 10000 and 3 = 00100. These strings of 1s and 0s can be considered as a binary number. And that can be summed up using a LOD. For 5 and 3 the sum gives us 10100 (in binary). And from this we can count all of the 1s including and to the right of the n-th digit. So the number of 1s including and to the right of the 3rd digit = 1, and the number including and to the right of the 5th digit = 2. Hence 3 = rank/index of 1, and 5 = rank/index of 2.

As noted about this approach is constrained to quite small ranges of whole numbers. However it does give us a generalised approach to solving our problem with LODs. What we need is an alternative way to add up all the values in a range, such that we can still count how many there are “including and before” any particular value.

This is where spatial functions come in. We can wrap spatial functions into LODs, and we can do things like UNION and COLLECT to combine spatial values, and we can do things like INTERSECT to find overlaps.

Approach using LODs and spatial functions

Here’s how we can stitch this together for a LOD equivalent of RANK_DENSE (which is the same as INDEX when there are only unique values):

  • Convert the values (that we want to rank) to a latitude between 0 and 90
  • Convert these latitudes to buffered points (that have an area) along longitude 0
  • Use a LOD to COLLECT all these buffered points (circles) together
  • Also convert each value’s latitude to a similarly buffered line from that latitude back to 0
  • INTERSECT the buffered line with the LOD COLLECT of buffered points
  • This leaves us with just the circle equivalents of all the values up to and including this one!
  • We can’t count these (as far as I know!) but we can get the combined area
  • And we can divide that area by the known area of a single buffered point
  • Which gives us the equivalent of a count, or RANK_DENSE
  • And for ranges with only unique values, this will be the same as INDEX

Illustration

To illustrate here is a table of sales by sub-category using sample superstore data, along with INDEX and RANK_DENSE using table calcs, and a LOD version.

index-1


Art is ranked fourth here. If we convert the sales values to circles between latitude 0 and 90. And also draw a line from Art back to latidtude 0 (buffered to the same radius as the circles) then we can see that this line intersects with four of the circles, equating to the rank/index of 4 for art:

index-2


Calculations to replicate RANK_DENSE

Let’s take a look at the calculations to replicate RANK_DENSE. For this example I’m using data with non-unique values:

rank-1


I’m going to use INCLUDE and EXCLUDE LODS, but you can try with FIXED too. Just remember that if you use FIXED you’ll need to account for any other filters/fields on your view (either adding filters to context or including the dimensions that you need to fix by), otherwise the FIXED LOD will bust out beyond the data you want to consider – e.g. include years that you wanted to filter out.

First up some base calcs that will be needed regardless of which version of RANK that we’re trying to replicate:

We need to know the minimum and maximum values in the range as these will end up equating to latitude 0 and 90…

In these calculations Sub-Category is the thing I want to rank and Sales is the measure I want to rank by. You will replace these with the dimension and measure you want to use from your data.

BASE (a) min

{EXCLUDE [Sub-Category]: MIN(
 {INCLUDE [Sub-Category]: SUM([Total Sales])}
 )}

BASE (b) max

{EXCLUDE [Sub-Category]: MAX(
 {INCLUDE [Sub-Category]: SUM([Total Sales])}
 )}

BASE (c) lat

( SUM([Total Sales]) - MIN([BASE (a) min]) )
/
( MIN([BASE (b) max]) - MIN([BASE (a) min]) )
*
90

Next up we convert the latitudes to a point, and we define a buffer radius to get a circle. Note that you may need to tweak this buffer radius if you’re finding that your values, when converted to latitudes and buffered are overlapping too much and giving you incorrect results.

BASE (d) point

MAKEPOINT([BASE (c) lat],0)

BASE (e) buffer m

10

Now we’re in to the RANK_DENSE specifics. We want a buffered circle, and a buffered line back to lat 0, per value. One thing to note here is that some of these spatial functions count as aggregations. And that means they won’t accept aggregated values, so we have to wrap any aggregations in another LOD first – hence the extra INCLUDE in the next calc.

RANKD (g) circle

COLLECT(
 {INCLUDE [Sub-Category]: 
 BUFFER([BASE (d) point],[BASE (e) buffer m],'m')
 })

RANKD (h) line

BUFFER(
 MAKELINE( MAKEPOINT(0,0), [BASE (d) point] ),
 [BASE (e) buffer m], 
 'm')

The penultimate step is to return the overall collection of circles that intersect with a specific value’s line. To do this we use the INTERSECTION spatial function. And we use an EXCLUDE LOD on the circles and an INCLUDE LOD on the line to ensure one is the overall set and the other is specific to the value.

RANKD (i) intersect

INTERSECTION(
 {INCLUDE [Sub-Category]:[RANKD (h) line]}, 
 {EXCLUDE [Sub-Category]:[RANKD (g) circle]}
 )

That leaves the final step being to convert the circles up to and including this specific value that we’re left with back into a count. I don’t think we can count the circles with the spatial functions that Tableau has, but we can return the AREA of the collection of circles, and we can then divide that area by the known area of a single circle (“pi r squared” – where r is the buffer)

RANKD!

1 + FLOOR(ZN(
 AREA([RANKD (i) intersect],'m') 
 / 
 ( PI() * SQUARE([BASE (e) buffer m]) )
))

The use of FLOOR here is to cater for a little bit of overlap in values that are close enough together when converted to latitudes that the buffered circles slightly overlap.  And this is the RANK_DENSE!

What about RANK_MODIFIED

In the screenshot above I also included a LOD version of RANK_MODIFED. The approach for this is a little more complex.

It involves a FIXED LOD to determine how many occurences there are of any given value (in my case SUM of Sales per Sub-Category):

BASE (f) num same

{FIXED 
 {FIXED [Sub-Category]:SUM([Total Sales])}:
 COUNTD([Sub-Category])
 }

Working out the radius of a circle whose area would be that number of times larger than the circle for a single occurence of the value:

RANKM (g) radius

// where r = normal radius
// what R (radius) do we need for a circle with N times the area
// pi*r*r*N = pi*R*R
// r*r*N = R*R
// R = SQRT(r*r*N)
SQRT( SQUARE([BASE (e) buffer m]) * [BASE (f) num same])

We then make our buffered points (circles) use this radius, and our buffered line use the max of these radiuses (radii?)

RANKM (h) circle

COLLECT(
 {INCLUDE [Sub-Category]: 
 BUFFER([BASE (d) point],AVG([RANKM (g) radius]),'m')
 })

RANKM (i) line – note the FIXED MAX for radius here

BUFFER(
 MAKELINE( MAKEPOINT(0,0), [BASE (d) point] ),
 MAX({FIXED: MAX([RANKM (g) radius])}), // line buffer max of circle buffers
 'm')

The intersection and final calc are then pretty much the same as the rank dense approach. And because the values with multiple occurences have bigger circles, the final divsion by the area for a single occurence gives us the RANK_MODIFIED.

RANKM (j) intersect

INTERSECTION(
 {INCLUDE [Sub-Category]:[RANKM (i) line]}, 
 {EXCLUDE [Sub-Category]:[RANKM (h) circle]}
 )

RANKM!

1 + FLOOR(ZN(
 AREA([RANKM (j) intersect],'m') 
 / 
 ( PI() * SQUARE([BASE (e) buffer m]) )
))

RANK_MODIFIED … LOD-ified!

What about INDEX?

If your values are all unique then either of the above approaches will give you the equivalent of INDEX too – pick the first / simpler one!

However if your values are more like my example, i.e more than one occurence of some values, and you were hoping for a LOD equivalent of INDEX, then…

Sorry! That’s eluding me at the moment.

Conceptually you “just” need to subtract something off the RANK_MODIFIED calc to spread the values back out into a unique index. But, the value you need to substract is itself an index, and in the example above (ranking sub categories) we now need an index across a text field, which this solution doesn’t cater for! In some ways it doesn’t need to be an index, just a unique value for each row shown. And I’d love to hear your ideas on this final step!

Caveats and closing remarks

A couple of caveats with the above approach if you do think that you need to use it.

Firstly you should assess performance for your data. The use of spatial functions and LODs is going to introduce additional work for Tableau to do.

Secondly really do sense check if you need a LOD-based approach in the first place. Often you can still use table calcs in cases you think that you might not be able to.

And finally if you got this far I hope that you found it as fun a challenge as I did. If nothing else it’s enjoyable to test the limits of Tableau. Here’s hoping that in the future we get inbuilt functionality like LEAD and LAG in SQL!

LOD equivalent of INDEX

Is there a level of detail expression equivalent of a particular table calculation in Tableau? This question comes up on the Tableau community forums every now and then, and almost always intrigues me.

Often a level of detail (LOD) expression isn’t really necesary, but occaisionally an alternative to table calculations is necesary. And like a mountain to be climbed, or a trail to be explored, I’m fascinated by whether it is even possible to implement LOD equivalents to some common table calcs.

I’ve previously posted on why this is a challenge, and how we can implement a LOD equivalent of LOOKUP in certain situations.

But what about INDEX? Or RANK_UNIQUE? The answer is yes, but I’ve only achieved it in quite specific situations:

  • calculating an index or rank over whole numbers only
  • where the numbers are all unique
  • and cover a relatively small range (though you can get around this as I discuss at the end of the post)

I’m posting about this in case it helps others, or triggers further ideas to improve the approach.

Solution walkthrough

For this walkthrough I’m going to use the following test data:

We have a list of fruit, each of which has a value. And we want to rank the fruit by that value. With a table calc this is easy! The third column below uses INDEX() computed by table down for example:

Firstly the big constraint: with a LOD we can jump out of the level of detail we have, to a higher or lower level of detail. However there really is no level of detail that tells us how many records have come before or after a particular record, as that before/after “group” is different for each record and so isn’t actually a different level of detail. My thinking is that we can’t simply use a LOD that counts the values (COUNT or COUNTD) before a particular record.

However, we can use a LOD to SUM up values across the data. And if any given value always had a discernable position within that SUM we might be able to derive an index. At this point my mind turns to bits in binary numbers, or converting numbers to and from base 2.

In binary:

  • 1 is represented by 001
  • 2 is represented by 010
  • 3 is represented by 011
  • 4 is represented by 100
  • 5 is represented by 101, and so on

Perhaps we can switch on “bits” (the 1s above) at the position for each value we have and use that…

Unfortunately Tableau doesn’t offer great binary arithmetic but we can use strings and a brute force calc (this is where the limit on the number range comes in).

The LOD version in purple matches the table calc version in green

And here is breakdown of the results of each step of the calc…

LOD INDEX


Let’s take a look at how it works…

x1 =

[Value]-{FIXED:MIN([Value])}

The x1 calc ensures that the range of values starts at zero by subtracting off the minimum value across the data set (note that if you’re filtering your data you’ll need those filters to be context filters, or include them as a list of dimensions you’re fixing by between FIXED and the colon).

Next we convert x1 into a binary number, switching on the bit (from 0 to 1) at the right position.

x2 = 

POWER(2,[x1])

And we can sum that up over the whole data set using a FIXED LOD (noting again that you’ll need to take into acount any filters on your view, adding to context or the dimensions fixed by).

x3 = 

{FIXED:SUM([x2])}

This number (144243 in the screenshot above) isn’t that useful as a normal decimal number. We need to convert it back to binary, and we’ll need to be able to work with that binary string of 1s and 0s. Annoyingly we need to brute force this as you’ll below!

x4 =

// ref: https://help.salesforce.com/s/articleView?id=001456706&type=1
// should handle numbers up to ~1M
STR(INT(([x3]/2^20))%2)+
STR(INT(([x3]/2^19))%2)+
STR(INT(([x3]/2^18))%2)+
STR(INT(([x3]/2^17))%2)+
STR(INT(([x3]/2^16))%2)+
STR(INT(([x3]/2^15))%2)+
STR(INT(([x3]/2^14))%2)+
STR(INT(([x3]/2^13))%2)+
STR(INT(([x3]/2^12))%2)+
STR(INT(([x3]/2^11))%2)+
STR(INT(([x3]/2^10))%2)+
STR(INT(([x3]/2^9 ))%2)+
STR(INT(([x3]/2^8 ))%2)+
STR(INT(([x3]/2^7 ))%2)+
STR(INT(([x3]/2^6 ))%2)+
STR(INT(([x3]/2^5 ))%2)+
STR(INT(([x3]/2^4 ))%2)+
STR(INT(([x3]/2^3 ))%2)+
STR(INT(([x3]/2^2 ))%2)+
STR(INT(([x3]/2^1 ))%2)+
STR(INT(([x3]/2^0 ))%2)

You’ll need to adjust this to suit your range of numbers and will almost certainly hit limitations with anything bar small ranges of numbers (you can see that I needed 17 lines of the above calc just to cater for a range of 3 to 20!)

The binary string we end up with, 000100011001101110011, has ten 1s in it (corresponding to the 10 values we had). With the position of each 1 representing it’s value (adjusted so the range starts at 0).

From here we can say “take the digits to the right of my position” for any given number.

x5 = 

RIGHT([x4],[x1]+1)

And that does give us a chance to count the numbers up to and including this number … by counting the 1s. We can do this by removing the 0s and seeing how long the remaining string is…

i LOD = 

LEN(REPLACE([x5],'0',''))

Which gives us our LOD equivalent of INDEX!

Hang on, what about…

What if I need a larger range of numbers? You may be able to extend the large 2^X string function above to cater for more digits.

What if that isn’t helping, or I need to index strings not numbers? You could pre-calculate an index in the data set (using Tableau Prep say), and then use the technique above on the pre-calculated index to allow for cases where rows are filtered out in the view. Assumption here is that the pre-calculated index will span a smaller range than the original numbers (or ensure you have a number in the first place if working to index strings). This was the approach taken in the forum post I shared above, and the challenge was adusting the pre-calculated index as the view was filtered … without using a table calc.

What other alternatives are there to table calcs and LODs? Self joining your data can allow you to see how many records fall before any given record, and that can be a simpler solution. E.g. join the data to iself where A.value >= B.value … means every row in A links to any row with the same or lower value.

So why do this with LODs? You really shouldn’t have to do this very often, if at all! Editing the compute by for table calculations often solves cases where you don’t think you can use a table calc. In the case that triggered this the results needed to be used in MAKEPOINT which didn’t accept a table calc.

Is that the only reason? No, for me it’s also a fun challenge. And one I don’t feel like I’m fully done with … if you find it intriguing too, can you take the idea further? Or do you have better ways to calculate and INDEX or RANK with a LOD for each row in a data set/view?

Tableau Prep and #PreppinData 2024 week 13

An Easter themed #PreppinData for 2024 week 13. Preparing sales of products in the 12 weeks running up to Easter to allow for easy comparison of the period across years in Tableau Desktop.

A nice one step solution this week (see screenshot at the end of this post): a FIXED level of detail calc to get the first sale date per year; then date calcs to get the week, day and day order.

This week marks a quarter of a year learning Tableau Prep. I started with the Tableau getting started guide (2 hours), committed around an hour a week for 13 weekly #PreppinData challenges (a great resource and progression of learning), answered a handful of Tableau Prep questions on the Tableau community forums (5 hours), and blogged about my learnings to reinforce them (5 hours). 25 hours all up and I feel like I’ve got a good grasp of the product and it’s capabilities. It’s a great tool for analysts and those needing to do ad-hoc but often repeatable data preparation, cleaning and transformation prior to analysis. Even better, it’s included with your Tableau Creator license if you’re using Desktop or Cloud as a creator! Definitely give it a go if you use Tableau and have a need to tidy data.

PD 2024 Wk 13

Tableau Prep and #PreppinData 2024 week 12

#PreppinData 2024 week 12, graduate student loan repayment calculator. Good to try out the “value ranges from two fields” option within a “new rows” step. Like some others my interest figure is a little different from the supplied output, however the calc appears to be the same. I also shortcutted the join onto repayment info for undergraduates with a filter (down to just the undergrad row), and joiner fields allowing a simple join on 1=1.

PD 2024 Wk 12

Tableau Prep and #PreppinData 2024 week 11

Week 11 of #PreppinData, and the question: what if there were 13 months in a year? Nice concept to have consistent 28 day months, with 4 weeks per month and each month starting on a Monday and ending on a Sunday. As we found out when we expanded the two row data set though … it’s not as neat as it seems, ending up with a spare day (or two in a leap year).

Part one of my flow answers Jenny’s question “which dates would change months?”. Output gives me 190 as expected.

Part two of the flow looks at Carl’s question of “what the new month(s) would be called”. Turns out the extra month most logically slots in between June and July. This is based on which old month is most associated with a new month number (the number of days of each new month that fall in an old month). And then the “average” of June and July (based on ASCII codes) is Juno. This may (or may not) be Dominican slang for extremely drunk, which coincidentally may (or may not) be what you’d need to be to suggest changing the current calendar in the first place! My preference for the extra day or two at the end is Extrember … as it’s a little extra after December.

PD 2024 Wk 11

 

Transform field per line data with Tableau Prep (2)

A couple of weeks ago I wrote about a Tableau Prep approach to transposing data from a text file that had a field per line, with another line separating records. At the time I noted that the approach wasn’t robust enough to handle optional fields, and that it would be annoying to need a join per field in cases where you had a large number of fields. In this follow up post I look at an alternative that doesn’t have those drawbacks.

The basic approach is to use a pivot of rows to columns, along with a record number to group fields from the same record. I also introduce a mapping table to allow for less hardcoding of field name and the record separator.

Here is the example source data I have used:

You’ll notice that Hobby and Food are optional fields that don’t appear on every record.

I also have an Excel sheet the defines the record structure:

Here you will see that I map a Field Label (which appears in the source date) to a Field Name that I want in the output. There is also a field called “(end)” which defines the record separator – in this case a blank line.

The Tableau Prep flow is as follows:

This flow:

  • Loads the source and structure files
  • Prepares them to allow them to be joined
  • Joins them together so that we have field names and know the “end of record” lines
  • Calculates a record number by looking for the “end of record” lines
  • Pivots the data from rows to columns
  • And outputs the result

When loading the source data I generate column names (F1, F2) and switch on the extra source row number field (which will be required for the record number calc later):

The “prep to join” steps replace NULLs with empty strings (for the record separator lines) and, for the source data, renames F1 and F2 to Field Label and Value.

The join step is now fairly straightforward, just joining the source data to the structure info on Field Label:

It results in the data like this:

For the pivot step we need something that groups fields for a record together. The “calc record num” step does that, as well as a little tidy up to remove now unnecessary fields and the record separator lines.

The calc for Record Number is using a running sum to work through the source lines in order (ORDERBY [Source Row Number] ASC), generating a number that increments every time we hit an “end of record” line:

{ORDERBY 
 [Source Row Number] ASC:
RUNNING_SUM(
 IF [Field Name]='(end)' THEN 1 ELSE 0 END
)}

A couple of notes re this calc. It actually makes the “end of record” line part of the next record, but as those lines will be filtered out anyway that doesn’t matter. I also have a follow up calc that adds 1 – it isn’t really necessary but Record Number “0″ looks a bit strange! This is the part that avoids the problem in my previous approach where optional fields weren’t catered for. Previously I was using whole number division to generate a record number from the source row number, and that relied on a set number of fields. The running sum option just looks for the record separators.

Next up, a basic pivot translates the rows to columns:

The pivot requires an aggregation for the values, but as we only have one value per row we can pick either MIN or MAX here. This pivot avoids the join per field that I had in the previous solution.

The output step then outputs the results, in this case:

NB: bold/line added after output.

The new approach successfully handles optional fields, doesn’t require a join per field, and has the added benefit of a mapping file to define structure. One downside is that the field/column order is reversed, but as the resulting CSV is for analysis in Tableau (say) that doesn’t really matter. In my first run I didn’t include the “Food” field; adding that in for a second run. So the solution is reasonably plug and play! You can use the same flow with your own source data and structure definition, you just have to right click the pivot step and “refresh” for it to pick up new field names.

A packaged flow for use in Tableau Prep is available on the Tableau Forums.

One-step Tableau Prep solutions

This quarter I set myself the goal to learn more about Tableau Prep, and a key part of that has been participating in the weekly #PreppinData challenges. Something I’ve noticed, and have been super intrigued about, is that some participants have been posting one-step solutions. That wasn’t surprising during beginner month, but now I’m seeing one-step flows covering reasonably complex multi-step data transformations. Cool!

This week I took a deeper dive into one of those one-step solutions to learn, and share, how they’re being done.

Two quick things first: They say a magician never reveals their secrets, so my apologies in advance to those Tableau Prep magicians who’d rather not see the “magic” shared. Also a hat tip to Hiroaki Morita, who’s week 7 solution I picked as the example to dive into. If I don’t do the techniques justice that’s on me, not Hiroaki!

Right, what do we mean by a “single step” solution? Basically, a single step between the input and output, like so:

For comparison here is my solution to week 7, which has two inputs and then four steps before the output:

The first thing you may notice is that there is only one source: a UNION to pull the two required data sets together. That might seem strange to you as the two tables of data are quite different. I’m going to add a quick “clean” step off the union to take a look at what it’s producing:

What we’re seeing here, boxed in red, is that the two tables are indeed unioned together, and so we get two chunks of data. The first chunk has the columns and rows from the first table of data (the couples and when their relationship started), and the second chunk has the columns and rows from the second table (the gift relevant to each year of a relationship – this challenge was about finding the right gift for each couple based on the length of their relationship).

This union approach seems to be a hallmark of single step solutions; get all of the data into one place/table so that we can operate over it in a single step. It does leave us with a challenge though as it’s not structured or related in the sorts of ways we’re used to, so let’s see how that is dealt with in the single clean step.

There are four sub-steps (or changes) in that clean step:

The first change is to calculate a consistent field across the two sets of data – in this case a “number of valentines days as a couple”. Where Year is NULL (no data) we’ll use the relationship start date [A] Where Year is not NULL we’ll remove the st, nd, rd, th letters from that Year field to just leave a number [B]. In a multi-step solution this would be the consistent field to join the two data sets together.

Because we have all the data in one data set we don’t need to join, instead we need to “look up” the gift from chunk B and plug it into chunk A (where it is missing). And we want to do that based on the consistent field calculated above. So in the screenshot above we can see that we have “number of valentines days as a couple” = 4 on the top row (for “The Loves”) but no gift (it is null). But we also have a row in chunk B with “number of days…” = 4 where we do have the gift (“Fruit/Flowers”).

The next change handles the look up. It uses a FIXED level of detail expression to say “get me the maximum gift from across the whole data set, for this number of valentines days”. Aggregations like MAX will ignore NULLs so we in essence look up the gift from chunk B:

This is potentially another hallmark of one-step solutions then: lookup the value you need from further down the combined data set, rather than using joins.

The fourth change is to filter down to just chunk A, chunk B was only there for the look up after all:

After that the solution simply removes the unecesary columns to be ready to produce the output. Clever, eh!

For me the key points of this one-step solution were:

  1. Get all of the data into one place/table so that we can operate over it in a single step.
  2. Lookup the value you need from further down the combined data set, rather than using joins.
  3. Filter out the data we only pulled in for the lookup.

I hope that you found this as intriguing as I did. And if you’re interested to see more one-step magic keep an eye on the #prep1stepclub X/Twitter hash tag!

UPDATE: Following a really good discussion with another member of the data community I thought I’d add a few notes about why and when you might use a one-step solution. Our conclusion was that one of the strengths of Tableau Prep is it’s clear, easy to understand and maintain, visual layout. Maintainability is a really important consideration, so you may never* use a one-step solution in production, favouring clarity and maintainability instead. However for your own professional development one-step solutions present a useful challenge. They introduce a constraint that forces you to think about problems differently, and in all likelihood use product features that you wouldn’t normally use. That gives you good practice. And afterall, why do we climb hills and mountains that we could otherwise go around?

* Although I say “never” I should point out that I haven’t performance tested common one-step solution patterns against their more natural counterparts. Consequently there may be some benefits (or indeed further drawbacks) that I’m not yet aware of. 

Transform field per line data with Tableau Prep (1)

I recently answered a question on the Tableau Community Forums about transposing data from a text file that had a field per line, with a line of dashes separating records. I’m not sure what the formal name for this format is, but there are similarities with RecFiles.

Here is an example:

I don’t know of a way to use data formated like that directly in Tableau Desktop. But we can use Tableau Prep to transform it into a more natural row per record format!

In this post I’ll cover how I suggested doing that for the forum question. And I plan to follow up with some more generic and robust options in a future post.

First lets take a look at the overall flow given the data above:

What we have here is:

  • An input step (on the left) to load in the file
  • A “clean” step to add a record number to each line
  • Three steps to separate the lines for each field
  • Join steps (Name+Age and Name+Age+Eyes) to join the data set for each field back together to give a traditional record structure
  • An output step to write out as CSV

Diving into each of these components:

The input step defines a split on TAB, headers (F1, F2, etc.), and enables the built in source row number that Prep can add. This row number will be important for identifying a record number next.

The next step adds a record number and removes the dashes which act as a record divider:

Record number is calculated using:

DIV([Source Row Number]-1,4)+1

This is basically just a whole number division (using DIV) of the row number by the number of rows per record, including the divider (4 in this case). Then we filter out the rows with the dashes to get rid of the record dividers. Note that I’ve also neatened up the field names in column F1 above to remove the colon.

Caveat: Because the record number is based on an expected number of fields, this approach won’t be robust enough to handle optional fields that do not appear on some records. This is one reason I’d like to come back and do another post on the topic!

Next we have a clean up step per field to grab just that field and it’s row number, including renaming the column header (F2) to the field name. Here is the step for “Name”:

This is repeated per field (annoyingly if you have a large number of fields!) but starts to get us closer to what looks like a row per record.

At this point though each step gives us a record with just one of the fields, and its record number. So next we need to join these up, two at a time, to bring the fields per record back together:

One more of these joins gives us a final output like this…

… meaning that we’ve successfully transformed a data set where each field is on it’s own line, into a more traditional row per record / CSV format, which is much more suited to analysis in a tool like Tableau Desktop.

Watch this space for part 2 where I dig into alternative and more robust approaches – e.g. to handle optional fields!

UPDATE: part 2 is now available.