Tuesday, April 10, 2018

The Data Security Issues Around Public MT - A Translator Perspective

This is a guest post by Mats Linder on the data privacy and security issues around the use of public MT services in professional translator use scenarios.

As I put this post together, I can hear Mark Zuckerberg giving his testimony on Capitol Hill to shockingly ignorant questions from legislators who don't really have a clue. This is not so different from the naive and somewhat ignorant comments I also see in blogs in the translation industry, on the data privacy issue with MT. The looming deadlines of the GDPR legislation have raised the volume of discussion on the privacy issue, but unfortunately not the clarity. GDPR will now result in some companies being fined, and since there is a possibility to calculate what it costs not to do it right, many companies are being much more careful, at least in Europe. But as the Guardian said: " If it’s rigorously enforced (which could be a big “if” unless data protection authorities are properly resourced) it could blow a massive hole in the covert ad-tracking racket – and oblige us to find less abusive and dysfunctional business models to support our online addiction."

As the Guardian wrote recently:

"This is what security guru Bruce Schneier meant when he observed that “surveillance is the business model of the internet”.  The fundamental truth highlighted by Schneier’s aphorism is that the vast majority of internet users have entered into a Faustian bargain in which they exchange control of their personal data in return for “free” services (such as social networking, [MT], and search) and/or easy access to the websites of online publications, YouTube and the like.

Big though Facebook is, however, it’s only the tip of the web iceberg. And it’s there that change will have to come if the data vampires are to be vanquished. "

In our current online world, only the paranoid thrive.

 Richard Stallman, President of the Fress Software Foundation had this to say:

"To restore privacy, we must stop surveillance before it even asks for consent.

Finally, don’t forget the software on your own computer. If it is the non-free software of Apple, Google or Microsoft, it spies on you regularly. That’s because it is controlled by a company that won’t hesitate to spy on you. Companies tend to lose their scruples when that is profitable. By contrast, free (libre) software is controlled by its users. That user community keeps the software honest."

 Apparently, there is a special term for this kind of data acquisition and monitoring effort. Shoshana Zuboff calls this surveillance capitalism

Also here is Valeria Maltoni on this issue:

"Breaches expose information the other way. They shine a light on the depth and breadth of data gathering practices — and on the business models that rely on them. Awareness changes the perception of knowledge and its use. Anyone not living under a rock now is aware that we likely don't know all the technical implications, but we know enough to start making different decisions on how we browse and communicate online.

Business models are the most problematic, because they create dependency on data and an incentive to collect as much as possible. Beyond advertising, lack of transparency on third party sharing and usage merit further scrutiny. Perhaps the time has come to evolve business practices — how platforms and people interact — and standards — based on laws and regulations"

 So when I read about how Google says in a FAQ no less, that they really with all their little heart promise not to use your data, or when Microsoft tells me they have a "No-trace Policy" about your MT data I am more than a little skeptical.  Especially, when just last week I get an email from Microsoft about an update to the Terms of the Service Agreement which contains some big updates in clause 2 related to what they can do with "Your Content".

While some may feel that it is possible to trust these companies I remain unconvinced and suggest that you consider the following:
  • What are the Terms of Service agreement governing your use of the MT service? (not the FAQ or some random policy page).The only legally enforceable contract an MT user has is what is stated in the TOS and I would not be surprised if there are not several loopholes in there as well.
 Once a large web services company sets a data harvesting ad-supported infrastructure in motion, it is not easily turned off, and while it is possible there may be more privacy in the EU, I have already seen that Google has made it very clear that they are using my data every time I use the Google Translate service. So my advice to you is Caveat Emptor if it really matters that your data privacy is intact. But if you send your translation content back and forth via email it does not make any differnece anyway. Does it?

Mats has made a valiant attempt to wade through the vague and ambiguous legalese that surrounds the use of these, mostly ad-supported MT services, in his post below.


How (un)safe is machine translation?
Some time ago there were a couple of posts on this site discussing data security risks with machine translation (MT), notably by Kirti Vashee and by Christine Bruckner. Since they covered a lot of ground and might have created some confusion as to what security options are offered, I believe it may be useful to take a closer look with a more narrow perspective, mainly from the professional translator’s point of view. And although the starting point is the plugin applications for SDL Trados Studio, I know that most of these plugins are available also for other CAT tools.

About half a year ago, there was an uproar about Statoil’s discovery that some confidential material had become publicly available due to the fact that it had been translated with the help of a site called (not to be confused with, the site of the popular MT provider MyMemory). The story was reported in several places; this report gives good coverage.

Does this mean that all, or at least some, machine translation runs the risk of compromising the material being translated? Not necessarily – what happened to Statoil was the result of trying to get something for nothing; i.e. a free translation. The same thing happens when you use the free services of Google Translate and Microsoft’s Bing. Frequently quoted terms of use for those services state, for instance, that “you give Google a worldwide license to use, host, store, reproduce - - - such content”, and (for Bing): “When you share Your Content with other people, you understand that they may be able to, on a worldwide basis, use, save, record, reproduce - - - Your Content without compensating you”. This should indeed be off-putting to professional translators but should not be cited to scare them from using services for which those terms are not applicable.

The principle is this: If you use a free service, you can be almost certain that your text will be used to “improve the translation services provided”; i.e. parts of it may be shown to other users of the same service if they happen to feed the service with similar source segments. However, the terms of use of Google’s and Microsoft’s paid services – Google Cloud Translate API and Microsoft Text Translator API – are totally different from the free services. Not only can you select not to send back your finalized translations (i.e. update the provider’s data with your own translations); it is in fact not possible – at least not if you use Trados Studio – to do so.

Google and Microsoft are the big providers of MT services, but there are a number of others as well (MyMemory, DeepL, Lilt, Kantan, Systran, SDL Language Cloud…). In essence, the same principle applies to most of them. So let us have a closer look at how the paid services differ from the free.

Google’s and Microsoft’s paid services

Google states, as a reply to the question Will Google share the text I translate with others: “We will not make the content of the text that you translate available to the public, or share it with anyone else, except as necessary to provide the Translation API service. For example, sometimes we may need to use a third-party vendor to help us provide some aspect of our services, such as storage or transmission of data. We won’t share the text that you translate with any other parties, or make it public, for any other purpose.”

And here is the reply to the question after that, Will the text I send for translation, the translation itself, or other information about translation requests be stored on Google servers? If so, how long and where is the information kept?: “When you send Google text for translation, we must store that text for a short period of time in order to perform the translation and return the results to you. The stored text is typically deleted in a few hours, although occasionally we will retain it for longer while we perform debugging and other testing. Google also temporarily logs some metadata about translation requests (such as the time the request was received and the size of the request) to improve our service and combat abuse. For security and reliability, we distribute data storage across many machines in different locations.”

For Microsoft Text Translator API the information is more straightforward, on their “API and Hub: Confidentiality” page: “Microsoft does not share the data you submit for translation with anybody.” And on the "No-Trace" page: “Customer data submitted for translation through the Microsoft Translator Text API and the text translation features in Microsoft Office products are not written to persistent storage. There will be no record of the submitted text, or portion thereof, in any Microsoft data center. The text will not be used for training purposes either. – Note: Known previously as the “no trace option”, all traffic using the Microsoft Translator Text API (free or paid tiers) through any Azure subscription is now “no trace” by design. The previous requirement to have a minimum of 250 million characters per month to enable No-Trace is no longer applicable. In addition, the ability for Microsoft technical support to investigate any Translator Text API issues under your subscription is eliminated.

Other major players

As for DeepL, there is the same difference between free and paid services. For the former, it is stated – on their "Privacy Policy DeepL" page, under Texts and translations – DeepL Translator (free) – that “If you use our translation service, you transfer all texts you would like to transfer to our servers. This is required for us to perform the translation and to provide you with our service. We store your texts and the translation for a limited period of time in order to train and improve our translation algorithm. If you make corrections to our suggested translations, these corrections will also be transferred to our server in order to check the correction for accuracy and, if necessary, to update the translated text in accordance with your changes. We also store your corrections for a limited period of time in order to train and improve our translation algorithm.”

To the paid service, the following applies (stated on the same page but under Texts and translations – DeepL Pro): “When using DeepL Pro, the texts you submit and their translations are never stored, and are used only insofar as it is necessary to create the translation. When using DeepL Pro, we don't use your texts to improve the quality of our services.” And interestingly enough, DeepL seems to consider their services to fulfill the requirements stipulated – currently as well as in the coming legislation – by the EU Commission (see below).

Lilt is a bit different in that it is free of charge, yet applies strict Data Security principles: “Your work is under your control. Translation suggestions are generated by Lilt using a combination of our parallel text and your personal translation resources. When you upload a translation memory or translate a document, those translations are only associated with your account. Translation memories can be shared across your projects, but they are not shared with other users or third parties.”

MyMemory – a very popular service which in fact is also free of charge, even though they use the paid services of Google, Microsoft, and DeepL (but you cannot select the order in which those are used, nor can you opt out from using them at all) – uses also its own translation archives as well as offering the use of the translator’s private TMs. Your own TM material cannot be accessed by any other user, and as for MyMemory’s own archive, this is what they say, under Service Terms and Conditions of Use:

“We will not share, sell or transfer ’Personal Data’ to third parties without users' express consent. We will not use ’Private Contributions’ to provide translation memory matches to other MyMemory's users and we will not publish these contributions on MyMemory’s public archives. The contributions to the archive, whether they are ’Public Data’ or ’Private Data’, are collected, processed and used by Translated to create statistics, set up new services and improve existing ones.” One question here is of course what is implied by “improve” existing services. But MyMemory tells me that it means training their machine translation models, and that source segments are never used for this.

And this is what the SDL Language Cloud privacy policy says: “SDL will take reasonable efforts to safeguard your information from unauthorized access. – Source material will not be disclosed to third parties. Your term dictionaries are for your personal use only and are not shared with other users using SDL Language Cloud. – SDL may provide access to your information if SDL plc believes in good faith that disclosure is reasonably necessary to (1) comply with any applicable law, regulation or legal process, (2) detect or prevent fraud, and (3) address security or technical issues.”

Is this the whole truth?

Most of these terms of services are unambiguous, even Microsoft’s. But Google’s leaves room for interpretation – sometimes they “may need to use a third-party vendor to help us provide some aspect of [their] services”, and occasionally they “will retain [the text] for longer while [they] perform debugging and other testing”. The statement from MyMemory about improving existing services also raises questions, but I am told that this means training their machine translation models, and that source segments are never used for this. However, since MyMemory also utilizes Google Cloud Translate API (and you don’t know when), you need to take the same care with both MyMemory and Google.

There is also the problem with companies such as Google and Microsoft that you cannot get them to reply to questions if you want clarifications. And it is very difficult to verify the security provided, so that the “trust but verify” principle is all but impossible to implement (and not only with Google and Microsoft).

Note, however, that there are plugins for at least the major CAT tools that offer possibilities to anonymize (mask) data in the source text that you send to the Google and Microsoft paid services, which provides further security. This is also to some extent built into the MyMemory service.
But even if you never send back your translated target segments, what about the source data that you feed into the paid services? Are they deleted, or are they stored so that another user might hit upon them even if they are not connected to translated (target) text?

Yes and no. They are generally stored, but – also generally – in server logs, inaccessible to users and only kept for analysis purposes, mainly statistical. Cf. the statement from MyMemory.

My conclusion, therefore, is that as long as you do not return your own translations to the MT provider, and you use a paid service (or Lilt), and you anonymize any sensitive data, you should be safe. Of course, your client may forbid you to use such services anyway. If so, you can still use MT but offline; see below.

What about the European Union?

Then there is the particular case of translating for the European Union, and furthermore, the provisions in the General Data Protection Regulation (GDPR), to enter into force on 25 May 2018. As for EU translations, the European Commission uses the following clause in their Tender specifications:

”Contractors intending to use web-based tools or any other web-based service (e.g. cloud computing) to execute the /framework contract/ must ensure full compliance with the terms of this call for tenders when using such services. In particular, the provisions on confidentiality must be respected throughout any web-based process and the Union's intellectual and industrial property rights must be safeguarded at all times.” The commission considers the scope of this clause to be very broad, covering also the use of web-based translation tools.

A consequence of this is that translators are instructed not to use “open translation services” (beggars definition, does it not?) because of the risk of losing control over the contents. Instead, the Commission has its own MT-system, e-Translation. On the other hand, it seems possible that the DG Translation is not quite up-to-date as concerns the current terms of service – quoted above – of Google Cloud Translate API and Microsoft Text Translation API, and if so, there may be a slight possibility that they might change their policy with regard to those services. But for now, the rule is that before a contractor uses web-based tools for an EU translation assignment, an authorisation to do so must be obtained (and so far, no such requests have been made).

As for the GDPR, it concerns mainly the protection of personal data, which may be a lesser problem generally for translators. In the words of Kamocki & Stauch on p. 72 of Machine Translation, “The user should generally avoid online MT services where he wishes to have information translated that concerns a third party (or is not sure whether it does or not)”.

Offline services and beyond

There are a number of MT programs intended for use offline (as plugins in CAT tools), which of course provides the best possible security (apart from the fact that transfer back and forth via email always constitutes a theoretical risk, which some clients try to eliminate by using specialized transfer sites). The drawback – apart from the fact that being limited to your own TMs – is that they tend to be pretty expensive to purchase.

The ones that I have found (based on investigations of plugins for SDL Trados Studio) are, primarily, Slate Desktop translation provider, Transistent API Connector, and Tayou Machine Translation Plugin. I should add that so far in this article I have only looked at MT providers which are based on providers of statistical machine translation or its further development, neural machine translation. But it seems that one offline contender which for some language combinations (involving English) also offers pretty good “services” is the rule-based PROMT Master 18.

However, in conclusion I would say that if we take the privacy statements from the MT providers at face value – and I do believe we can, even when we cannot verify them – then for most purposes the paid translation services mentioned above should be safe to use, particularly if you take care not to pass back your own translations. But still, I think both translators and their clients would do well to study the risks described and advice given by Don DePalma in this article. Its topic is free MT, but any translation service provider who wants to be honest in the relationship with the clients, while taking advantage of even paid MT, would do well to study it.

Mats Dannewitz Linder has been a freelance translator, writer and editor for the last 40 years alongside other occupations, IT standardization among others. He has degrees in computer science and languages and is currently studying national economics and political science. He is the author of the acclaimed Trados Studio Manual and for the last few years has been studying machine translation from the translator’s point of view, an endeavour which has resulted in several articles for the Swedish Association of Translators as well as an overview of Trados Studio apps/plugins for machine translation. He is self-employed at Nattskift Konsult.

Friday, April 6, 2018

UTH - Another Chinese Translation Memory Data Utility

This is a guest post by Henry Wang of UTH. I include a brief interview I conducted before Henry wrote this post. I think this focus on developing a data marketplace is interesting as I happen to believe that the data used to train the machine learning systems is often more important than the algorithms themselves. The number of open source toolkits available for building Neural MT system is now almost 10. 

 I do not have a sense of whether the quality of the UTH data is better than other data utilities that exist and this post is not an endorsement of UTH by me. They, however, appear to be investing much more effort in cleaning the data, but I still feel that the metadata is still sorely lacking for real value to come from this data. And metadata is not just about domain classification. It will be interesting to see the quality of the MT systems that are built using this data, and that evidence will be the best indicator of the quality and value of this data to the MT community.

These data initiatives in China also reflect the building AI momentum in China. If you have the right data you can learn to develop high-value narrow purpose focused machine learning solutions. 

  1. What are the primary sources of your data?
Henry: The primary sources of our data include LSPs(language service providers), freelance translators, language service buyers, and several big data organizations.
  1. Can you describe the metadata that you allow users to access to extract the most meaningful subsets for their purposes? Can you provide an overview of your detailed data taxonomy?
Henry: We created a three-tier pyramid structure of the data with 15 top-tier domains, 41 intermediate domains, and 178 bottom-level domains. Users can extract the subsets by choosing domain names (among the three tiers), language combinations, and other items that we provided and are going to provide on our product UIs.
  1. Who are your primary customers?
Henry: MT companies/labs, LSPs, AI companies, e-commerce companies and universities
  1. Do you price differently for LSP who might use less data than for MT developers who need much more data?
Henry: Yes
  1. Do you plan to provide an English interface so that users across the world can also access your data?
Henry: Yes, we have launched several products with English UIs, including Sesame Search (
  1. Do you have your own MT solution? How does it compare with Google for some key languages?
Henry: We are working on that. We also partner with Sogou and several MT labs in China for different language combinations. We believe we will do better than Google in China-related language pairs, and this will come true within 2 years.
  1. Do you see an increasing interest in the use of this kind of language data? What other applications beyond translation?
Henry: Yes, an increasing number of leading AI, e-commerce, MT, and cross-border business companies are reaching out to us for cooperation. Also, we see a big potential in the education/e-learning field. Sesame Lingo is one of our innovative products for language teaching and training with the language data in the core database. Other applications include smart writing and pure data mining that might be applicable to many industries.
  1. What are some of the most interesting research applications of your data from the academic sector?
Henry: Corpus-based studies, and a lot of others.
  1. What are the most urgent data needs that you have by language where there is not enough data?
Henry: Southeast Asian languages, and South Asian languages.
  1. Are you trying to create new combinations of parallel data from existing data? e.g. If there is English to Hindi and English to Chinese in the same domain and subject – could you align the data to create Hindi <> Chinese data?
Henry: Yes, we already mastered that technology years ago, thus an increasing number of language combinations and an increasing amount of data.
  1. What is your feeling about the general usefulness of this kind of data in future?
Henry: With the development of data mining technologies, it will be applied to many more industries for sure. We are currently working very hard on in-context data and comparable data, which will be even more useful.


UTH, a Shanghai-based company, is a pioneer in the language service industries. UTH’s mission is to deliver innovative solutions to overcome challenges in the language services with petabyte translation data. Since 2012 when it was founded, UTH has accumulated more than 15 billion translation units across over 220 languages, including Arabic, Bulgarian, Chinese-Simplified, Chinese-Traditional, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Japanese, Korean, Polish, Portuguese, Romania, Russian, Slovenian, Spanish, Swedish, Thai, Lao, and Khmer, which enables it to secure a strong foot holding in China’s Belt and Road Initiative with a majority coverage of the languages used in participating countries and helps it win the support and cooperation of research institutions, language service buyers and providers, IT giants, e-commerce companies, government agencies as well as the investment support from venture capitals. Last year, it had successfully completed its Series B investment from Sogou, the second largest search engine by mobile queries in China. Sogou has completed its own IPO last year and posted $908.36 million of revenue in FY17.

UTH enhances its translation data business with the diversification in the handy tools in MT, language teaching and learning, and corpus research, which in turn sharpens its insights in the exploitation of big language data and artificial intelligence. Sesame Lingo is one of its products used for language teaching and training with parallel corpora data in the core database, and Sesame Search is an online corpus platform featuring multiple dimensional data classification, search, intuitive data presentation and patented language processing technologies. Recently, UTH has completed several acquisitions to expand its business territory in e-learning, smart writing, data mining and language services. With the strong alliances of Sogou, 5 mid-sized LSPs, 2 AI companies, and much more in 2018, UTH has already established an initial eco-system and become the largest translation database in China. UTH has seen an increasing number of leading AI, e-commerce, MT, and cross-border business companies worldwide reaching out to it for potential collaboration opportunities.

UTH embarks on a pioneering road similar to TAUS, yet UTH possesses a uniquely different advantage. TDA from TAUS is based on a data-sharing mechanism, and the control of data quality is largely determined by data-owners’ integrity and their internal quality control process. When accumulating the language data, TAUS exploits a data-breeding technology, in which it cross-selects the translation units from different languages but with a common translation in the third language to form new pairs. At UTH, more than 50 in-house corpus linguists and engineers, supported by around 400 contracted linguists, are working meticulously in language data sourcing, collection, alignment, and annotation, overseen by the trained testers under rigorous internal quality rules. UTH has formulated a relatively complete set of language quality management practices with reference to LQA models and ISO standards, embedded in the in-house tools for higher efficiency.

 UTH’s close cooperation with academia sectors imbues the company with a unique perspective on the potentials of language data. Its data are classified into a unique three-tier pyramid (15 Level I domains, 41 Level II domains and 178 Level III domains) for the purpose of mapping the requirements in LSPs to the academic disciples in Chinese universities and making the data easily accessible to teachers and students on campus, which have won wide acclaims from education experts. In addition, the company launched its education cooperation initiatives in 2017, building several internship bases and joint research programs with prestigious universities in China and overseas, including Southeast University, the University of International Business and Economics, and Nanyang Technological University.

UTH’s focus on in-domain and in-context data is currently its priority and its major differentiator. As the largest repository of parallel articles in China, UTH is cooperating with LSPs (language service providers), freelance translators, language service buyers, and several big data organizations, orchestrating the high-quality data flow among these organizations and turning the immobile language data into flowing values. As a hub in the data exchange, filtering, and processing, UTH becomes an indispensable part and a booster in this trade.

Nowadays, with the increasingly wide application of NMT in Twitter, Facebook, WeChat, QQ and UGC platforms as well as the industrial application of MT in interpretation that help people connect each other across language barriers, translation is growing into a crucial business energizer. However, the technology edges of the forerunners such as Google is diminishing, resulting in a closer gap in the translation quality among NMT vendors, including Bing, SYSTRAN, SDL, DeepL and Baidu, Sogou, NetEase, and iFlytek in China. NMT is a data-hungry application, where data is fed into neural networks to improve its intelligence. Therefore, good quality and fine-tuned translation data will become a crucial part of this fierce competition.

As a trailblazer in China, UTH is now feeding its translation data to several MT companies and MT labs, and together improving the final products, in the hope that it will do better than Google in Chinese-related language pairs in the very near future.

Saturday, March 31, 2018

The Future of Translation in a Gig Economy

 This is a guest post by Luigi Muzii based on a presentation he made recently on the coming industry disruption. Digital disruption is all around us today, and one characteristic of how it manifests is how it sneaks up and is in control before the incumbents even realize what has happened. Sears and other retailers barely saw Amazon coming, taxi companies did not see Uber coming, the hotel industry was caught unaware by Airbnb and so it goes.  

A recent Cisco/IDC suggests that 40% of all companies today will be affected (disappear) by the digital disruption. 

 It is quite possible that a new player could emerge in the translation industry that produces a platform that allows buyers and sellers to congregate and conduct efficient transactions with minimal broker (LSP) support (or just platform support). The translation industry is one that still struggles to talk about quality in a way that is clear and meaningful to customers. A platform that provides CAT tools, properly integrated MT, and a straightforward means to discuss quality and the deliverable, could be a force of disintermediation. Several have tried and failed recently but the platform technology is getting better all the time. It could happen. Soon. There are too many people involved in performing repetitive, mundane tasks in a translation project, and some say it is ripe for change. 
  • Platforms ensure consistency, quality, and a good customer experience through the whole buyer journey
  • Platforms enable new people to enter the marketplace, both buyers, and sellers and often expand the traditional view of the market place
  • Platforms are especially powerful means to create transformation around service offerings 
  • TM and MT are not disruptive in themselves, but properly organized into intelligent  AI workflow solutions they can indeed be used to deliver disruptive services
  • Mostly, platforms deliver new, no-hassle, customer experiences to industries where people have gotten used to less than satisfactory CX.
 This article suggests that getting legal advice for "standard" legal work is a service that is ripe for disruption. If this is indeed true, then how long before "standard" translation work will also follow?

 Don't be surprised if soon you might be able to ask a chatbot for specific legal advice.



Preparing for disintermediation: Or what will the future look like in a global gig economy?

The following are a few basic questions about the gig economy using the classic “Five Ws (and One H)” rule of rhetoric:
  1. What is the gig economy?
  2. Who benefits from it?
  3. Where does it apply?
  4. When is it going to prevail?
  5. Why is the translation industry affected?
  6. How is disintermediation relevant?
The answers to these questions raise a few more ones that will hopefully be given a tentative answer.
Let’s start with a brief recap first.

The translation industry

While the translation profession as we know it today was born in between the two world wars, with the development of world trade, the so-called translation industry was born between the late 1980’s and the early 1990’s, with the spread of personal computing.
In practice, with the burst of technology, in a few decades, a century-old single practice rapidly evolved into shops and then into an industry.

Industry 4.0 & Translation

The same irruption of technology has led to two new industrial revolutions.
In fact, in 2011, the German government coined the term “Industry 4.0” to indicate the “fourth industrial revolution” with smart machines capable of autonomously exchange information, triggering actions and control each other independently via the Internet, big data analytics, and AI.
Does the translation industry fit “4.0”? With some effort and a little imagination, the translation industry could be halfway between “2.0” and “3.0.”

The Gig economy

Let’s now address the six fundamental questions. The first one is, what is the gig economy?
The term gig was coined in the 1920s by jazz musicians to mean “engagement.” The concept of gig economy was introduced in 2009, when the effects of the financial crisis began to bite badly, to describe the economic activity of people using digital platforms for short-term engagements to make a living.

Where does the gig economy apply?

A gig economy typically develops after the disruption of markets following the establishment of technological platforms connecting businesses and independent professionals. In this respect, any market is exposed to the gig economy if its players can be digitally connected to customers regardless of their size and position.
The use of self-employed workers is not a peculiarity of post-crisis years. Businesses have been trying for decades to replace the traditional employment model to escape taxes and labor laws. Previously, intermediaries were used instead of digital platforms.

When does the gig economy prevail?

It has been happening, from consumption and leisure to services and manufacturing. Companies like Airbnb, Amazon, Foodora, Netflix, Uber, Upworks have been disrupting their sectors and nothing can apparently stop them, not even the class actions of drivers and riders or the efforts to have them pay their dues to the communities they thrive on.

Why is the translation industry affected?

The business model is roughly the same as that of the gig economy. The parcellation of jobs, the infinite quest for the lowest remuneration, the way jobs are dispatched, and how people are hired and remunerated in the gig economy is no news in the translation industry.

So even the most celebrated companies of the gig economy have little to teach to their translation industry counterparts except, maybe, for the tech element and the sophistication in tax evasion.

Who benefits from the gig economy?

The promises about the gig economy may sound appealing. Digital technologies let workers become entrepreneurs, free from the drudgery of traditional jobs while making extra cash in their free time.
Indeed, workers in the gig economy are often manipulated into working long hours for low wages and continually chasing the next gig, while companies exploit the many loopholes in the tax and labor laws.

The surge of the digital economy has led to a new feudalism and those who own the platforms are the new vassals.

How is disintermediation relevant?

The gig economy with its new landlords is reaching into all other industries, and localization is no exception.
Digital platforms are disrupting old-fashion markets by parceling out jobs in discrete tasks and matching customers and workers, with pay being determined by demand only.
From the customer’s perspective, disintermediation is the answer to their quest for convenience and for cutting out the additional costs charged by intermediaries.
Parcellation of jobs has been happening for a few years now in the localization industry. A major difference with the companies of the platform economy is the use of platforms.

The great decoupling

The wild side of the sharing economy and the gig economy is that convenience and affordability also come at a price, usually from eluding taxes and laws, thus, eventually, damaging the society.

Also, the sharing economy has created a new monstrous type of customer who expects the service level of the Ritz Carlton at McDonald’s prices.

And what about the promise of the sharing economy of freedom and additional substantial income? It couldn’t be farther from the truth. In fact, the growth of the sharing economy presents an economic paradox: Productivity is rising, while median income is flatting out.

Finally, the on-demand economy was supposed to unleash innovation. Can you see any real innovation coming? Or only a typical Schumpeterian “creative destruction”?

The future

The future is not what it used to be. With computers performing already 99% of translation jobs, a totally new approach should be devised to curb threats and take advantage of any opportunities brought by innovations.

Some questions arise then, that should be answered however challenging: Will the translation industry survive? How long? What will the translation business look like in five years? Is a career in translation still advisable? What are the options and the strengths to explore? What are the threats and the weaknesses?

How long will the translation industry survive?

Some people claim that the demand for translation is growing and that it will keep growing in the coming years, but the measurement approach followed so far is questionable. As a matter of fact, any growth in revenues may correspond to a growth in volumes, but it may also hide a stagnation if not really a decline in prices, and, possibly, in profits.

Looking at production life cycle stages, translation revenues might already have peaked, while profits have possibly been decreasing for a few years now. This would explain the revival of the M&A frenzy: Organic growth is getting harder and harder, more and more investments are required to keep businesses profitable, and consolidation is the easiest way to grow and the most profitable exit strategy.

What will translation look like in five years?

In five years, the platform war will be over and a bunch of wealthy few will most probably rule the business world.

However young, the translation industry is fast drawing near the end of a cycle and desperately needs to be renovated. Especially in the last few years, translation industry players have been desperately struggling to meet the demands of translation buyers craving to process ever-growing content volumes into more language pairs. Unfortunately, talents don’t combine with the abundance of tools, technology, and data because a varied bouquet of skills is increasingly required, while education initiatives are dramatically lagging. And while MT will keep proliferating, the shortage of talent will be much more serious.

In fact, the emphasis on language knowledge is still overstated when expectations are growing every day that Internet giants are going to solve the pesky language problem once and for all, without any intricacies and possibly at almost no cost.

LSPs should then be utterly concerned about the sustainability of their business models. Scrambling for scale might not be enough even for the largest providers: Translation will still be here in five years, it will be here also in twenty years, but the translation industry may not.

Is a career in translation still advisable?

Translation education still looks less demanding, thus faster, than scientific and technical education. The lower return is perceived as the result of high costs rather than of low benefits. Yet, however friendly technology may look today, skills other than languages are more and more needed to cope with the growing complexity of the business world.

In this respect, with the almost total absence of any real specialization from translation education, newbies and even practitioners will be needing intense continuous training all the time more to specialize and try to keep up with the growing expectations.

Unfortunately, with LSPs struggling to keep profiting despite obsolete, inefficient, and costly processes while resisting their customers’ pressures on prices, pays will keep lowering, thus forcing the best resources out of business. At the same time, the harshness of the gig economy will force more and more people with technical and science skills to look for additional incomes in translation. No specialization in medicine, biology, law, engineering, etc. would make a translator any better at translation than a physician, a biologist, an attorney, an engineer with the same language pair and the access to the same tools and resources.

What are the options and the strengths to explore?

Three areas should then be explored, technology, knowledge, and data. Machine translation is now a general-purpose technology and will be a game changer even more than it has been so far. Indeed, MT is going to be so pervasive as to be embedded practically in every tool and application. Don’t forget that the washing machine has changed the world more than the Internet, and yet many would hardly be able to tell how and how much.

Knowledge will be as important as technology. Language is a technology too, but it is useless without the necessary ability to exploit it. Just like language, any other technology is no magic wand. Technology does not solve problems, people do with their practical intelligence. The same practical intelligence allows them to devise the processes that enable technology to maximize benefits and minimize risks.

Finally, the human brain is still the most powerful processing tool when it comes to reasoning. And knowledge allows people to pick the best data to have the machine make inferences and reliable predictions.

What are the threats and the weaknesses?

The major threat comes from the business model that is common to most translation business players. Not only is this model obsolete and largely wasteful, it is a major reason for disintermediation. And, in fact, industries remaining too long as such with large inefficiencies are ideal candidates for disruption.

A major weakness comes from what is conversely often perceived and brandished as a weapon: Information Asymmetry. Only distrust and discontent come from the imbalance in transactions due to the inability of buyers to assess the value of service before sale.

Another significant weakness is the growing skill shortage. This is due to a killing combination of increasingly lower pays driving best resources out with inadequate educational programs producing poorly-skilled would-be translators.

Finally, the constant tide of new entrants and substitutes will help further undifferentiation and minimize any network effect.

New entrants

The many affordable technologies and the very low financial, commercial, and legal barriers will result in new entrants being more and more often outsiders. But raising barriers is not the solution.

On the eve of disruption

Decreased transaction costs are expunging intermediaries from electronic value chains.

This means that also a buyer-seller matching platform for translation could be hard to develop, setup and run profitably. A so-called marketplace is not enough. For real disintermediation, best-matching algorithms are required to shorten the traditional translation supply chain. However, project management can hardly be totally automated especially for large and complex jobs involving several language pairs. The same goes for vendor management.

However, for small, single-pair jobs there will be more and more customers searching for translators through portals, willing to use them as virtual one-stop shops. Also, these customers will most probably be more and more expecting to have their content translated nearly for free if not for free. On the other hand, this is a typical sharing economy effect.

Will you be still willing to fight for any customer and any job, even for those going for the cheapest price? There will be more and more of them even among the once premium customers in the legendary premium segment.

So what?

If your competitors are getting stronger and stronger and you are unable to outdo them, you might band together with them and possibly gain some advantage rather than just giving up.

Side with evil

In other words, you can embrace the sharing economy and try to replicate the success of the companies in the gig economy.

The other dude

In this case, be ready to embrace Uber’s co-founder Travis Kalanick’s philosophy and get rid of the other dude, i.e. go for complete automation.


Be also aware that high-attrition rates may not be a feasible long-term strategy. Unfortunately, and yet unsurprisingly, when getting bigger and bigger, rather than investing more money and more ability in the employee experience, companies usually become worse places to work. And in this case, things can get very bad if the tide of side-giggers withdraws.

Reputation is a unique asset, that is hard to gain and much too easy to spoil, with both customers and vendors.

So, the next time you find yourself thinking about cutting costs to raise profits or protect your margins, remember that someone might pay your savings and your reputation will eventually be affected.


There is no reason to fear disintermediation. Technology allows mindful players to develop and provide new service bouquets, but this requires a strong brand, the ability to differentiate from competitors, and deep diversification of services.

Digital transformation is no child’s play, though, and every business has its own intricacies. When seeking business opportunities in foreign markets, companies are challenged with functions they are not expert at and must adapt fast. Most of these companies embraced automation and digital transformation way earlier, but they still have issues in handling their digital content.

Focus on the "S" in LSP

There are many good opportunities for LSPs there, provided they can re-shape their business models and start adding real value. In times of industry 4.0 companies are no longer willing to partner with old-fashioned organizations with virtually no real tech savvy.

Mindful LSPs may start by refining their service offering by including consulting services in their bouquets for companies that are trying to do business abroad.

Exploit technology

In this respect, the approach to technology should go well beyond CAT, TMS, and MT and extend to modern content processing technologies and techniques like machine learning, AI and natural language processing, to make content more useful to humans and computers.

Turn data into assets

To this end, LSPs should turn their data into assets and make the most of it. Machine learning algorithms are rapidly becoming a commodity, and the cost of even the most advanced of them will soon plummet. The value will not be in algorithms, then, but in data, that is indeed the oil of the digital era.


As a first step, start measuring. Through measurement, you’ll know more, reduce uncertainty, and thus risks. Anyway, for correct measuring you must perfectly know your data, master metrics, identify key measurements and the right tools to use, and, above all, develop and continuously refine your methods of measurement.

Once you have made your measurements and collected the results, convey this information to customers so that they can positively correlate it with your capabilities.


“Lunch atop a skyscraper” is a very famous picture, but few probably know its title, history, and, above all, who shot it. Even fewer would know who shot the “shooter”.

This is the kind of knowledge that might be considered specialized and yet it’s available to all, but you must have the practical intelligence to acquire it.

Today a computer system can play Go or drive a car, but still, no Go-playing computer can also drive a car. Machines may perform specific tasks, but they lack understanding of the world—sentience—and cannot transfer knowledge laterally between domains.

Translation will be more and more an engineering thing, but machines will remain dependent on humans for building their “knowledge” from training data for the foreseeable future.

Embrace the future

The future has already begun, tempus fugit, time is running out and it is always less than you expected. So, if the question is when, the answer is now, hic et nunc, before it’s too late.


Luigi Muzii's profile photo

Luigi Muzii has been in the "translation business" since 1982 and has been a business consultant since 2002, in the translation and localization industry through his firm. He focuses on helping customers choose and implement best-suited technologies and redesign their business processes for the greatest effectiveness of translation and localization-related work.

This link provides access to his other blog posts.

Thursday, March 29, 2018

Tmxmall - Translation Memory Marketplace Overview

This is a guest post by the principals of Tmxmall, which is a Chinese Translation Memory Marketplace, i.e. a site that sells translation memory to interested parties (mostly translators and LSPs). This is one of two such intiatives coming from China. I will be featuring the other one shortly.

Data is a critical requirement for any of the modern MT technology paradigms. While SMT was able to handle some level of noise  in the data used, it appears that NMT is more particular and the "more data is better" principal does not apply as clearly, if it ever did. The quality of the training data matters with AI and machine learning.  The pursuit of building up training data resources will always exist in the context of any Machine Learning technology. However, commodity data that you can easily obtain or buy tends to usually be low quality data, and not high-quality data by most people's assessment.

This is what TAUS has to say about these new initiatives:
"The internet giants had a competitive edge in translation data, but they spoiled it by polluting their own fishing grounds with machine translations. Now, the hunt is open for new data marketplaces. The European Commission is investing in the Connecting European Facility. But watch out also for the greenfield translation data ventures in China, or perhaps closer to home: the TAUS Data Cloud."
To the best of my knowledge data sharing initiatives have not been particularly successful with SMT. There has always been an issue of uneven quality when disparate data is pooled together. I am not sure this changes with these new TM marketplaces. I believe that rich metatags where some meaningful and consistent objective indication of quality, is provided and is likely needed to make such data exchanges viable. I recall in my experiments with TDA data that it is often wise to completely exclude certain lower quality datasets and sources but this understanding came after much trial and error and effort.

I find the vision of the cloud based Online CAT much more compelling than desktop solutions, and I would not be surprised if these collaborative, big data based work multi-tool environments do indeed become increasingly more compelling, even to power users of yesterdays technology.

So here are some of my favorite quotes about data.

"Data-intensive projects have a single point of failure: data quality" George Krasadakis, Data Quality in the era of AI.
“By far, the greatest danger of Artificial Intelligence is that people conclude too early that they understand it.” -  Eliezer Yudkowsky

Peter Norvig   “We don’t have better algorithms. We just have more data.”. 

“The sad thing about artificial intelligence is that it lacks artifice and therefore intelligence.”  - Jean Baudrillard

“We’re entering a new world in which data may be more important than software.”- Tim O’Reilly, Founder, O’Reilly Media

“Data is a precious thing and will last longer than the systems themselves.”- Tim Berners-Lee, father of the Worldwide Web

Despite the hype, we should understand that deep learning algorithms are increasingly going to be viewed as commodities. It's the data where the real value is. K.V.😉

 For those of you interested in eDiscovery and new applications for MT in Information Governance please check out this webinar that I just did together with Nuance.


Tmxmall is one of the leading providers specializing in language assets management and promoting TM global sharing with its headquarter in Shanghai, China. We are a team of technology and language geeks who have created products around translation memories, helping translators and MT providers better use translation memory data.

Our Story

In 2014, Jing Zhang, the founder, and CEO of Tmxmall, who gained his bachelor degree from Northwestern Polytechnical University in Computer Science and Master degree in Information Management from Tianjin University, left Baidu and started his own business with his classmate Jian Chen, who worked for Huawei and Baidu, and now the CTO of Tmxmall.

Both Jing and Jian are fascinated by information retrieval and search engine. Motivated by this interest, they applied their advantages to retrieval and leverage of translation memory, and are working on mining more valuable information from TM data and promoting data sharing and trading.

Jing Zhang (left) and Jian Chen(right)

The Status

Tmxmall pays high attention to data and takes great efforts to capture value from language data we have. For the present, we have nearly 7 billion thousand sentence pairs which are classified by 34 language pairs including English, Japanese, Russian, German and over 10 domains such as the economy, bioscience, law, and medicine. Those data are mainly in Chinese/English into other languages which are from the offline exchange, purchase from LSPs and freelancers, web crawling, and bilingual documents alignment. Among them, zh-en/en-zh human translated domain data and zh/en - Southeast Asian Languages data are the most popular in China. We now have developed services and products that all revolve around translation memories ecosystem to help users fully leverage our data and also manage their language assets. Recently, Tmxmall has localized its official website into English so that users across the world could get benefits from our service.

Our Products

TMXMall Roadmap

Tmxmall Aligner

Tmxmall Aligner is an online tool used to create translation memories and align parallel texts. The Aligner supports 17 file formats and 19 languages in modes including bilingual documents and one single document. It processes parallel texts by applying Tmxmall’s self-developed leading algorithm based on paragraphs and sentences and could automatically recognize sentences in terms of three pairing situations: one source to multiple targets, multiple targets to one source and multiple targets to multiple targets. It also allows de-duplication, filtering and finding & replacing, ensuring a more convenient and efficient process when creating translation memories.

TM exchange platform

Tmxmall TM exchange platform is where users retrieve translation units for every translation units uploaded. Users can also upload their own translation memories for others to retrieve, download and purchase.

TM SaaS management System

Tmxmall TM SaaS management System is designed for users to manage their TMs, allowing users to upload, share, retrieve and delete TMs, and conduct collaborative translation by referring to or updating TMs in real time. Users including freelance translators and LSPs can rent the system according to their use capacity.

TM Marketplace

TM marketplace is the place for TM sharing and trading, supporting TM files in 19 languages. Users can upload their own TMs, search for matches, sell or purchase segments matched with the data stored on the platform. Connected with TM SaaS management System, every TM that users bought from the marketplace can be managed in the TM SaaS management System. For now, data sold on TM marketplace costs $1.50 for 1000 words with a 100% match; $1.24 with a 95-99% match; $0.78 with a 85-94% match; $0.45 with a 75-84% match. The money goes to the data owner when a TM transaction is finished.


TM ROBOT is a client software for managing and sharing local TM data and is developed based on TM Marketplace for users who are hesitant to upload their TMs online. It’s also designed for promoting knowledge sharing economy by connecting global TM data, helping users obtain lasting yields by sharing TMs while respecting their translation achievement, and making language assets reusable to enhance production efficiency in translation industry. When TM ROBOT is installed, users are allowed to manage, share TMs and search for TM matches on TM marketplace on their computer.

TM ROBOT Working Module
There are three core modules in TM ROBOT’s working environment: the client terminals (TM ROBOTs) who are willing to share TMs, Tmxmall TM marketplace and CAT tools that are integrated with Open API of Tmxmall TM marketplace. The client terminals (TM ROBOTs) will choose the TMs that are allowed to be shared and then submit the random sentence pairs to a P2P platform. If the quality of the submitted random data is approved, the source TMs in client terminals will be included in the whole P2P platform and then can be retrieved by all users on this platform. When users translating in their CAT tools which are integrated with Open API of Tmxmall TM marketplace, the source sentences will be sent to TM marketplace. Tmxmall TM marketplace will then distribute the source sentences to all the client terminals(TM ROBOTs) who are online and ever shared TMs. When the client terminals received the query request, client terminals will search the local shared TMs and return the matched results to Tmxmall. Tmxmall will summarize all the results and then return the optimized result to the CAT users. When the optimized results returned, Tmxmall will deduct the relevant fees from CAT users’ accounts in Tmxmall platform system.

Tmxmall API

Tmxmall API is a plug-in that integrates the whole data stored on Tmxmall platform(TM exchange platform, TM Saas system, TM marketplace and TM ROBOT) into Desktop CAT tools including SDL Trados and MemoQ, and online CAT tools like Tmxmall online CAT. By using Tmxmall API, language data on Tmxmall platform can be searched when conducting translation in CAT tools.

Online CAT

The Online CAT is developed for translators or small teams when handling small translation projects. It seamlessly connects all language data stored on Tmxmall platform and supports Google Translate and pre-translation. We are now working on a new version of an Online CAT which will be released in the coming year. It will support large translation projects and enable translation workflows of freelance translators and LSPs. A variety of input formats, real-time supervision, machine translation, QA check and simultaneous translation & reviewing will be supported by then. Particularly, the online CAT will be integrated with large TM data on Tmxmall’s TM exchange platform, TM SaaS management System, TM marketplace, and local TM ROBOT.

Tmxmall Online CAT Features

Worried about data quality? So are we.

Our primary users are freelance translators, LSPs, teachers in universities’ translation and Interpretation Programs and MT providers. As a responsible enterprise, we put our users first and so definitely care about the data quality. Every TM uploaded on our platform will be verified by staff at Tmxmall. Only the human translated TMs which are aligned orderly can be approved and published. Besides manual verification, we are developing a QA tool where the QA metrics are built-in to spot errors such as punctuations, numbers, and omissions. For users who want to purchase data, they can view the random 30 sentence pairs as a sample to get an overview of the quality.

Research on TMs

From the time Tmxmall was established, we never stop our journey on TMs researches. During years’ studying, we have achieved several successes:
  • Based on automatic alignment algorithm of machine translation, bilingual documents can be aligned automatically with 95% accuracy rate.
  • With our leading natural language processing technology and thousands of high-quality TMs, lower-quality sentence pairs will be automatically spotted through TM assessment process algorithm.
  • By leveraging CNN classification technology, automatic classification of large-volume TMs with an accuracy up to 97% is now possible.
  • The responsive time of billion sentence pairs retrieval is reduced to 200ms after the distributed optimization of distributed search engines.
  • Tmxmall Machine Translation Plug-in is now available in SDL Trados. It supports machine translation tools including Google Translate, Baidu Translate, Sougou Translate, Youdao Translate and Newtranx, allowing users to produce more translated materials without increasing costs.


Our Ambition

Recently, we have seen an increasing interest in MT data, especially for MT developers to train their MT engine, which means that the research and implementation of Machine Translation would boom in the near future. Since 2014, we have accumulated a large volume of language data, which allows us to dream big and step towards the AI machine translation industry. By virtue of self-developed leading algorithm, data mining technology, and language data, we are able to train our domain MT engines by using specific language data, so as to produce machine translation with accuracy and quality. Transforming from language data research to MT engines attempt, Tmxmall always treats data as the guidance on every aspect of our business and strongly believe this will be the best long-term way for us to grow and thrive.