Page MenuHomePhabricator

Write a very basic string extractor
ClosedPublic

Authored by epriestley on Feb 4 2014, 12:33 AM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Nov 20, 12:50 PM
Unknown Object (File)
Tue, Nov 19, 1:59 PM
Unknown Object (File)
Sun, Nov 17, 7:57 PM
Unknown Object (File)
Thu, Nov 14, 12:04 AM
Unknown Object (File)
Sun, Nov 10, 12:38 AM
Unknown Object (File)
Wed, Nov 6, 10:45 AM
Unknown Object (File)
Sun, Oct 27, 4:15 PM
Unknown Object (File)
Thu, Oct 24, 10:22 PM

Details

Reviewers
btrahan
Maniphest Tasks
T1139: Internationalize Phabricator
Commits
Restricted Diffusion Commit
rP0726411cb490: Write a very basic string extractor
Summary

Ref T1139. This has some issues and glitches, but is a reasonable initial attempt that gets some of the big pieces in. We have about 5,200 strings in Phabricator.

Test Plan

{F108261}

Diff Detail

Repository
rP Phabricator
Branch
i18n1
Lint
Lint Passed
Unit
Tests Passed

Event Timeline

is there a deeper reason, why you don't just use gettext which has such a parser already included?

See also my German Phabricator branch:

https://github.com/schlaile/phabricator
https://github.com/schlaile/libphutil

Specifically the following simple shell script does the string database updating:
https://github.com/schlaile/phabricator/blob/master/scripts/internationalization/build_message_files.sh

Proper plurals are also possible with gettext, haven't them implemented yet though.

gettext has also the nifty feature, that you can use tools like poedit etc. to edit the language text strings.

If you have any further questions, feel free to ask.

gettext doesn't support genders or strings with multiple plurals, does it? For example, I think it can not translate these correctly into all languages:

%s liked this post.
There are %d apple(s) and %d banana(s).

In the former case, some languages require separate translations if the user is male or female. As far as I know, gettext does not provide support for this. Conceivably we could fake it by using plural translations, but that would get messy quickly.

In the latter case, there are two plurals. As far as I know, gettext does not support this either.

Generally, we believe we have a good understanding of the problem domain and a simple, straightforward path to solving it without making any compromises. gettext would add a lot of complexity, and I believe it has a lot of limitations we'd have to work around. We wouldn't get much in exchange.

If translators want to use tools like poedit, we could write a script that converts our internal format into .po, and then compiles .po back into our internal format. That would probably take a lot less time to write than switching everything over to gettext.

gettext doesn't support genders

http://stackoverflow.com/questions/6143547/handling-grammatical-gender-with-gettext

Using contexts looks like a nice solution.

or strings with multiple plurals, does it?

okay, didn't know you wanted to walk the extra mile of doing that properly.

For example, I think it can not translate these correctly into all languages:

  %s liked this post.
  There are %d apple(s) and %d banana(s).

In the former case, some languages require separate translations if the user is male or female. As far as I know, `gettext` does not provide support for this. Conceivably we could fake it by using plural translations, but that would get messy quickly.
In the latter case, there are two plurals. As far as I know, `gettext` does not support this either.

Generally, we believe we have a good understanding of the problem domain and a simple, straightforward path to solving it without making any compromises. `gettext` would add a lot of complexity, and I believe it has a lot of limitations we'd have to work around. We wouldn't get much in exchange.

Well:

  • you'd get a very fast lookup engine for translation strings (stored in binary files). If you really want to do that seriously, you have to switch over from simple PHP-files.
  • You'd get a pretty good string extractor (which you had to write yourself and which isn't finished...).
  • You'd get a very good plural engine, see for polish horror:

http://stackoverflow.com/questions/12121515/multiple-plural-forms-in-gettext

But you are right, multiple plurals won't work very nice.

  • You'd get a lot of pretty nice tools for handling the translations (poedit etc.).
If translators want to use tools like `poedit`, we could write a script that converts our internal format into `.po`,

keep in mind, that you'd have to add some extensions to the .po-message formats, to actually handle multiple plurals. AFAIK the message format is currently limited to handle only one plural. Maybe adding the proper support to gettext and pushing that upstream could be also worth doing (will help a lot of OSS projects on the planet).

and then compiles .po back into our internal format. That would probably take a lot less time to write than switching everything over to gettext.

well, gettext integration with your current engine wasn't that hard (it was a matter of adding a TranslatorClass and some small changes to libphutil).

But if you want to do it the right way(tm), even better.

We absolutely agree, that choosing / writing the translation engine, isn't the largest part of the problem, but actually writing and maintaining the message catalogs is.

So: feel free, to use my German message catalog, when the time is right, I got somewhere between 30-40% of the strings translated right now.

Have a good time and thanks for writing Phabricator in the first place! (I'm actually managing a music store with it - in a total abuse of it's original intend. Works even with a non technical audience, so I think, you got a lot of things pretty much right :) )