All CAT tools can use translation memory—translation memory is the key characteristic of all CAT tools and a significant tool for increasing efficiency. There are tools that provide just the translation memory function (translation memory manager, translation memory tool). To continue the discussion in the last post, I’ll talk about translation memory and issues surrounding it.
Translation memory (henceforth TM) is a software that allows translators to re-use existing translations. It is a software and it’s often bundled with other tools, but at its core it’s a database that ties source sentences (unit or segment is the better term; I’ll explain further below) and their corresponding target sentences together.
The difference between TM and glossary
The glossary is used when you need to translate somewhat specific terminology for specific companies, industries, or fields to prevent mistranslation. It’s also referred to as a terminology database, term base, or lexicon, but they basically mean the same thing. The structure is very similar to that of TM—source words and target words are linked together. It also contains source words that should be left as-is (NTBT, not to be translated) as well as definitions. The glossary and TM are very similar, but the difference lies in the content of segments. Glossaries deal with words or phrases and TMs deal with sentences, so they have different uses. TM helps increase the speed of translation while Glossary helps increase accuracy when translating certain words. The glossary acts as a guide, prevents ambiguity in translation, and is quite useful when dealing with very technical documents. But it cannot help you increase speed by using previously saved translations.
Utility of TM
I’ve mentioned this in the previous post about CAT tools, but TM is very helpful during the translation process. It may differ a little from CAT tool to CAT tool, but when a segment appears that’s exactly the same as or very similar to a source segment while you are working, TM will allow you to reuse that translation immediately. If you’re working on a document with a lot of repetitions or if you are updating a document you’ve already translated before, this tool is very powerful. TM is also very helpful when you need to maintain consistency in updated documents. To find out more about the advantages of using TM, please refer to the post that analyzes the advantages of CAT tools.
Related terms and concepts
1) Segmentation and TM production
The most important concept when it comes to TM is segmentation. Sentences are the basic unit of TM. If you open a document using a CAT tool (or TM tool), it will break down the document into individual sentence units. Each unit is called a segment. CAT tools do this because identifying by paragraphs is more convenient for later use than identifying by words.
A sentence is the most common unit and that’s why TM remembers sentences. However, segments aren’t always sentences. For example, titles or sub-titles are not always sentences, but they are important and meaningful units, so they both become segments. Text boxes with content also become separate segments. When items are listed using bullet points, each bullet point is a segment. If the source file is an Excel file, the CAT tool user can choose to have each cell as a segment or have each sentence within a cell as a segment. Depending on what you choose, several sentences in one cell can be one segment or each sentence in a cell can be broken down into several different segments.
The process of a CAT tool breaking a document down into segments is called source document analysis. It’s quite remarkable that the software can do this. How can a machine identify meaningful units so well? The CAT tool uses a simple trick. It uses periods and enter keys to break down the document into segments. (Decimal points are not recognized as periods.) That’s why the CAT tool is able to recognize the title as a segment and a sentence as a segment.
However, CAT tools often make mistakes because the method is so simple. For example, the tool may break down ‘U.S. Congress’ into several statements: ‘US.’ and ‘Congress.’ Not so remarkable. They may also recognize ‘etc.’ as one segment even if a lower case letter comes after it. Also, when you make a source document by reading a PDF with an OCR program, each line ends with a manual line break, so the CAT tool breaks down each line into a segment. When this happens, the translator can merge segments manually to correct it. If a mistake was made in the source document and a period was omitted, the translator can break the segment manually into the appropriate number of sentences.
After segmentation is complete, a TM will be produced as the translator translates. When one source segment is translated, that source segment and its corresponding target segment become one unit of the TM.
How is a generated TM used later within a CAT tool? There are two methods.
- Set the default so that the TM is not used. The translator can decide for each individual segment whether he or she wants to use the TM.
- Use the TM automatically when you open the source document and use it to translate.
With the second method, the translator needs to decide what the minimum matching percentage needs to be in order to input TM into the target segments automatically (this is called the threshold). After the threshold is set, the CAT tool will use the TM to automatically input a translation into the target segment when the source document opens. This is called pre-population. These pre-populated segments can be used as-is (by pressing the enter key or clicking on another segment), or the translator can edit the segment as needed.
3) Repetitions, 100% matches, fuzzy matches
If a segment corresponds 100% with a saved TM entry, that segment is called a 100% match. In theory, that segment can be accepted as-is. (But sometimes, the context may be a little different, so it may have to be edited slightly.) If a segment is not a perfect match but is quite similar to a saved TM unit, that segment is a fuzzy match. Fuzzy matches can match between 0% to 99%, but matches that are less than 70% are not actually very useful. However, the match may increase if the segment is merged or divided.
Repetition, which is different from match, is also important. Repetition does not refer to a similar segment in the TM, but rather to the same segment being repeated within the source document. In this case, even if you don’t have a saved TM, you can translate the repeated segment once and all the other repeats will match 100%.
If you analyze a source document using a CAT tool or implement TM pre-translation, you’ll be provided with a chart with information on repetitions, 100% matches (or ‘previously translated’), or fuzzy matches. This is important information that can help the translator guess how long it will take to translate the document. The following chart is a common example:
- 100% matches
- 95% – 99% matches
- 85% – 94% matches
- 75% – 84% matches
- Unique words (74% or lower)
Controversy surrounding TM ownership
As you can see, TM is an important asset that does not cost anything to store and many people want to acquire it.
[bctt tweet=”TM is an important asset that does not cost anything to store and many people want to acquire it.” username=”HappyKoreas”]
1) The translator who generated the TM argues that the TM is theirs.
2) In some cases, agencies argue that the TM is theirs. That’s because agencies request discounts from translators based on repetitions or matches.
3) In other cases, there are those who argue that the TM belongs to the end client who created the source document in the first place. (End clients often don’t even know what TM is, but some agencies try to obtain more clients by offering to provide TM.)
The discussion is quite chaotic as of now. Many parties involved have different viewpoints on the issue of TM ownership. In my opinion, I think the TM belongs to translators. If translators sign confidentiality agreements, they may not use the TM freely in all situations, but if it’s for another project from the same client, they are free to use or not use the TM. There are also cases where the TM can’t be used as is just because there are TM matches, so translators end up putting in more work. Finally, because the translator purchased the CAT tool on his or her own to generate or use the TM, the agency or end client does not have a strong argument for owning it (though it’d be difficult to use this argument when using free online CAT tools). Also, a lot of the TM I generated involved extra effort on my part. I converted files that could not be used right away to generate a TM. I don’t think it’s right that anyone else should argue that they own this TM. I plan to express my opinion whenever I have the opportunity. However, I do provide discounts for repetitions and matches. If the base rate is set at a high rate, it’s not too much of a problem to make reasonable adjustments. Agencies can also provide discounts to end clients in this way. (But the translator won’t know if the discount is made known to the end client.) I hope that someday translators will be provided additional pay for TM should the agency request it. Then, the translator won’t feel so bad providing discounts when using TM that an agency or end client provides—translators may actually feel obligated to provide discounts if that’s the case. Most importantly, the person who generated the TM has already been compensated and the translator will be able to translate faster using a TM obtained from someone else. I wonder when that day will come. Right now the discussion is quite chaotic. Most end clients don’t know what TM is and in many cases, agencies don’t really pay attention to TM. (This is actually good for the translator. But I have a feeling that as time passes, agencies will have no choice but to become more attentive to TM.) Thus, I think it will be difficult for TM management to become a common industry practice for now.