TLK (DA2)

From Dragon Age Toolset Wiki
Jump to: navigation, search

This article contains some analysis of the TLK format found in Dragon Age 2. The article assumes prior knowledge of the GFF format.

Raw View

raw view
This image shows an example of how the raw talk file in Dragon Age 2 looks like. As you can see, the readable text at a first glance is minimal, limited to the few recognizable strings as GFF, type, version, etc.

We will assume that there are ways to convert the raw format into something more readable, and in the next paragraphs we will explore some ways to make it happen

Header View

header view
Firstly, we will look at the big header, but because it's a standard format that never changes across different talk files, and the GFF format is covered in depth in a different article, we will not spend a lot of time on it

The GFF file header has:

- 4 bytes GFF Magic Number,

- followed by four bytes as version V4.0,

- followed by four bytes for platform, in this case PC,

- followed by four bytes as file type TLK,

- followed by four bytes as file type version V0.5.

- after that, 4 bytes show the number of structures inside this GFF, in this case only 2.

- the next four bytes is the data offset and shows number 78 which is hexadecimal for 120, and show at what byte starting from the beginning of file the actual talk file starts, basically this means that the first 120 bytes are the file header.

- after that we get four bytes with the name(type) of the first out of two structures HTLK,

- followed by four bytes that show the number of fields inside the structure, in our case 3,

- followed by four bytes that show the field offset, which means at what byte starting from the beginning of file the data related to this structure actually starts, in our case 3C in hexadecimal, which is decimal 60,

- followed by four bytes that show the length of the structure, 0C hexadecimal in our case, which is decimal 12 which is 4 bytes * no. of fields 3 = 12

- next, similar information about the second out of two structures named HSTR

- which has four bytes that showed the number of fields inside the structure, in this case 2,

- followed by four bytes that show at what byte starting from the beginning of the file the data related to the structure actually starts, in our case hexadecimal 60, which is decimal 96

- followed by four bytes that show the length of the structure, 08, which is four bytes * no. of fields 2 = 8

- at this point we reached 3C( hexadecimal), which is 60 decimal byte, which is where we learned from the above information that the first structure HTLK related data starts, here is where the field array starts for this first structure

- the relevant data here is displayed in a predictable manner according to the GFF documentation, basically you get Label, Field Type, Index (3 field subtype) in increments of four bytes, for each of the two structures fields,HTLK has three fields and HSTR has two fields, 2+3 = 5 total structures multiplied by 3( field subtype) = 15

- And we reached byte 78 hexadecimal/120 decimal. Once again everything up to this point is identical across all the talk files in Dragon Age 2

The next sequence of bytes: at bytes 78 hexadecimal, you can see 0C (four bytes). I'm not sure what this is yet, but seems to be the same value across talk files. After that there are a sequence of four more bytes, in our example 90 hexadecimal, which may be the size in bytes of the section of the talk file where the list of strings is saved. It may also be the size of the serialized Huffman tree that would be discussed later on.

String ID View

string ID view
Moving on to the next picture, the first four bytes listed 224 hexadecimal, seems to be the size of the compressed string, but I'm not sure. These three values( in our examples 0C,90,224 hexadecimal) are the only one that I'm not 100% sure of, I hope I can do better with the rest of the explanation.

In this section, there is not much to say, after the first four bytes 224 hexadecimal that I'm not sure of, the rest is clearer

- Four bytes, in our case 10 hexadecimal, 16 decimal is the number of strings contained in this example talk file

- After that there is a sequence of 16 pairs ( the number of pairs has to be identical with the number identified in the line above, in our example 16 decimal)

- Each pair consists of a string ID and a number that represents the bit count where the string actually starts inside the compressed string

- In our example the first string ID in decimal is 6200472 ( little endian 989C5E hexadecimal), followed by the string Start count, in our example AB hexadecimal, which is decimal 171, and basically means that the string ID 6200472 first character starts at position 171 inside the compressed string. The compression is a standard Huffman coding, which uses simple concatenation when compressing, so knowing the position of the first character of any string, can easily help us get the rest of the string decoded from the compressed stream. Hopefully it would be clearer later on in the article.

- There is not much else to say, except that in this example you can see that we have redundant strings( with different ID but with the same starting point, so basically the strings uncompressed are the same, for what ever reason the Dragon Age 2 creators decided to create identical strings with different string ID). Also, some strings in this particular example are empty. it may not be obvious that they are empty by looking at the raw content, but it's apparent once all the strings are decompressed. Again, I don't know why an empty string is actually needed. this is a side note, and not really relevant, I wouldn't mind if someone has more info about these empty strings

Huffman coding

Huffman coding
If you look at the picture, the first important set of four bytes is the one that in our example is hexadecimal 64 ( decimal 100). This number shows the number of nodes used in the Huffman tree specific to this talk file. It is a very good idea before moving forward to glance over the coding article on Wikipedia.

Each node is represented by four bytes. Some nodes are actually characters( and follow standard ASCII encoding), but the developers decided to use them in a negative number format that's why they look like XX-FF-FF-FF format. These nodes are called leaves. It is very easy to identify which character in ASCII encoding they represent( although it may also be Unicode): the formula is FF-FF-FF-FF -( minus) XX-FF-FF-FF. To simplify we can use only FF-FF - XX-FF. The result is the actual character encoding in hexadecimal. In our example the first sequence of four bytes in the negative representation is D3-FF-FF-FF. If we apply the formula mentioned above FF-FF - D3-FF, the result is 2C hexadecimal which is 44 decimal which represents the character "D" ( the capital letter D). So we know that this specific talk file definitely contains a string that has the character "D".

The other format is in positive representation XX-00-00-00, and they are internal nodes, which act as fillers in the Huffman tree, these nodes do not represent characters, but instead connect leaves to parent nodes.

Some more information: in the original Huffman coding the formula is that the Huffman tree contains N leaves + (N internal nodes - 1), so 2N-1 total number of nodes. In Dragon Age 2, there is an extra important leaf needed which is the character that signifies the end of a string. In this case, the negative number format for the specific character was decided to be decimal -1 which is hexadecimal FF-FF-FF-FF. Which is perfect when used with the above formula, because this is how the algorithm knows when the string inside the compressed stream was fully identified and decoded, and that happens when the formula generates 0 as the result of FF-FF -(minus) FF-FF

Without this character, the original Huffman coding formula would be, in our case 50 leaves + ( 50-1)internal nodes = 99 nodes in the tree, which is one less then the expected 100 decimal/64 hexadecimal that was identified as the number of nodes in the tree in the first step of this section.

so in our Dragon Age case we end up with N+1 leaves + N-1 internal nodes = 2N

The pattern to serialize the tree nodes is : the nodes in the tree are evaluated as some sort of Breadth-first search ( check out the article in Wikipedia for more information), basically you insert the nodes in a sequence reading them from left to right and from the lowest leaf to the highest node in the tree ( excluding the root, which is not considered).so in our example letter D represents the lowest leaf from the left of the tree, that's why is the first one that appears in our serialized sequence as D3-FF-FF-FF. And then, one by one in sequence, from left to right and bottom to top enumerate the leaves and/or internal nodes as you encounter them up until the last internal nodes which are the immediate children of the only root( which once again is not considered in the serialization process). As a side note, it is not unexpected that the last negative number which represents an actual leaf/character in our list is DF-FF-FF-FF, which turns out to be hexadecimal 20, which is decimal 32 which stands for the "SPACE" character, which is the most common character in pretty much any talk file.

Later on we will discuss an actual example, to test the assumptions we learned above

5. Compressed Strings

compressed strings
After the serialized Huffman binary tree we can see a sequence of four bytes that show number 66 hexadecimal, which is 102 decimal, which is the number of sequences of four bytes that the compressed stream in its entirety consists of, so this number shows that the compressed stream is 102*4= 408 bytes long, and inside this 408 bytes stream the 16 strings from our example are compressed into

one thing to notice is that because our pattern follows increments of four bytes, the last two bytes in the stream are actually two pairs of zeros, this means that the actual 16 strings need only 406 bytes, but because 406 is not the multiple of four, we had two more empty bytes as padding

Now we can take a look at the simple example to see if what we learned above applies

Let's consider the following starting strings: in this simple example the pattern we would follow is that numbers represent string ID, and capital letters represent the actual string. The character "|"( bar) represents a visual representation of the special character that signals the end of the string as discussed in a previous section

1	A|

2	AB|

3	ABC|

4	ABCD|

5	ABCDE|

6	ABCDEF|

7	ABCDEFG|

8	ABCDEFGH|

9	ABCDEFGHI|

using any Huffman coding program that can be found on the Internet, we get the following Huffman tree

Key:A	 Value:9	 Frequency:0.167	 Binary:110

Key:|	 Value:9	 Frequency:0.167	 Binary:111

Key:B	 Value:8	 Frequency:0.148	 Binary:101

Key:C	 Value:7	 Frequency:0.130	 Binary:100

Key:D	 Value:6	 Frequency:0.111	 Binary:010

Key:E	 Value:5	 Frequency:0.093	 Binary:001

Key:F	 Value:4	 Frequency:0.074	 Binary:000

Key:G	 Value:3	 Frequency:0.056	 Binary:0110

Key:H	 Value:2	 Frequency:0.037	 Binary:01111

Key:I	 Value:1	 Frequency:0.019	 Binary:01110

Let's analyze: we have nine strings, and each string contains "A", so we also have 9 occurrences of "A". Because we use "|" for visual representation where the string terminates, we also have nine occurrences of the special "|" character. And so on for each character we encounter we add it to the table

Now let's create a basic visual representation of the tree, so we can use it as an aid when we count left to right and bottom to top each node, regardless if it's a leaf( actual character) or an internal node( used as filler/connector). Remember, we don't consider the root. Also as a convention, we will consider 0 on the left, and 1 on the right,it's easier to read the tree left to right.IN means internal node, and the number increments bottom to top and left to right

Huffman tree representation

applying our assumptions that we learned previously, we can see that the first item in our serialized tree will be I ( capital i). If you remember, the actual characters are represented as negative numbers."I" ASCII code is 49 hexadecimal. FF-49=B6, so our first sequence of four bytes in the serialized tree looks like:B6-FF-FF-FF. Reading left to right and bottom to up our visual tree above, the next sequence of four bytes is "H", which going through the motions turns into B7-FF-FF-FF, followed by "G" a.k.a. B8-FF-FF-FF, followed by the first internal node we encounter IN0, which is represented as positive 00-00-00-00.

now we reached "F" which is B9-FF-FF-FF, and so on and so on, just follow the left to right and bottom to up pattern. From the tree above we can see for example that the last two sequences of four bytes in our serialized tree are 06-00-00-00 and 07-00-00-00

if we count all the nodes in our tree, we can see that they are 18 decimal/12 hexadecimal, and following the formula 2N, which means N+1 for leaves and N-1 for internal nodes, then we have 10 leaves and 8 internal nodes,N=9

and we are done serializing the Huffman tree, we have all the information we need, such as complete number of nodes(18d/12h), the correct order of nodes, starting with letter "I", and ending with IN7

Now let's move on to handling the compressed stream that contains the nine strings along with the termination character.

if we look at our sequence of strings from our working example, the first string is "1 A|", which has only two characters: "A" and the termination character "|" (character BAR, not to be confused with letter capital "i")

now if we look at our table, we can see that "A" is represented as Binary 110, and "|" is represented as Binary 111

if you recall, Huffman compression is just a simple concatenation, so we can expect that our first string "A|" would be represented as "110"+"111" = 110111

you may also recall from the section 3 of this article: string ID, that "Each pair consists of a string ID and a number that represents the bit count where the string actually starts inside the compressed string". For our first string, the string ID is 1, and for simplicity we will assume that the count starts at zero.

This means that the section of the talk file that deals with string IDs, will represent the relevant information about the first string as "01 00 00 00" sequence of four bytes as the string ID, followed by another sequence of four bytes as the counter in the compressed stream, in our case for the first string, the counter starts from zero, represented as "00 00 00 00"

now let's take a look at the second string, and apply the same thing: "2 AB|"

binary speaking "AB|" string is converted into 110101111, according to our Huffman table

the second string counter starts where the first string ended, the first string started at 0, had a length of 6 bits, and ended at position 5 in the compressed stream, so the second string counter has to start at the next available bit in the stream, which is 6. This way we know that the second string ID is "02 00 00 00", and the counter would be "06 00 00 00"

let's do one more string "3 ABC|"

binary it becomes 110101100111

its ID is "03 00 00 00", and its counter is 15 decimal which is 0F hexadecimal so it is "0F 00 00 00"

And so on, and so on. the binary representation of all the strings converted turns the compressed stream into the following binary stream:

"1 1 0 1 1 1 1 1 0 1 0 1 1 1 1 1 1 0 1 0 1 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 0 0 1 0 0 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 1 1 1 1 0 1 1 1 0 1 1 1 "

For example we can validate by looking at the last three characters in our stream, which we assume it should be "HI|", and indeed the last 13 bits in our compressed stream are the 5 bits representing "H", followed by the 5 bits representing "I", followed by the 3 bits representing the termination character "|"

01111+01110+111 = "HI|"

Now there is one more step, which is to turn the binary stream of zeros and ones into its hexadecimal equivalent. We know that any byte is 8 bits, and any 8 bits combination of 0 and 1, has its equivalent of any combination of 2 hexadecimal numbers from 0 to F. For example if we take the first 8 bits from our compressed stream 11011111, which in compressed form stand for "A","B" and 2/3 bits of "|", that sequence of eight bits actually has the hexadecimal equivalent of "DF". In our example, the long binary stream turns into a sequence of 22 bytes in hexadecimal representation, something like: "DF BB 37 7B A3 BD 11 F7 46 C4 DE 88 B0 BD 11 61 7E 6F 44 98 77 03". But if you remember, we need to apply some padding so that the sequence of bytes in hexadecimal form that represents the compressed stream is multiple of 4, so in our example the final hexadecimal string looks like "DF BB 37 7B A3 BD 11 F7 46 C4 DE 88 B0 BD 11 61 7E 6F 44 98 77 03 00 00"

The raw view of a working example end result can be seen below:

raw view of working example