Считать файл UTF-8 в строку UCS-4

Question

Считать файл UTF-8 в строку UCS-4

Я пытаюсь прочитать файл в кодировке UTF-8 в строку UTF-32 (UCS-4). Внутренне, я хочу, чтобы внутри приложения был символ фиксированного размера.

Здесь я хочу убедиться, что перевод выполняется как часть потоковых процессов (потому что именно для этого предполагается использовать Locale). Альтернативные вопросы были опубликованы для выполнения перевода в строке (но это расточительно, так как вам нужно выполнить этап перевода в памяти, а затем сделать второй проход, чтобы отправить его в поток). Делая это с локалью в потоке, вы должны сделать только один проход, и нет необходимости делать копию (при условии, что вы хотите сохранить оригинал).

Это то, что я пытался.

#include <iostream>
#include <fstream>
#include <locale>
#include <codecvt>

int main()
{
std::locale     converter(std::locale(), new std::codecvt_utf8<char32_t>);
std::basic_ifstream<char32_t>   iFile;
iFile.imbue(converter);
iFile.open("test.data");

std::u32string     line;
while(std::getline(iFile, line))
{
}
}

Поскольку все они являются стандартными типами, я был удивлен этой ошибкой компиляции:

/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/istream:275:41:
error: no matching function for call to 'use_facet'

const ctype<_CharT>& __ct = use_facet<ctype<_CharT> >(__is.getloc());
^~~~~~~~~~~~~~~~~~~~~~~~~

Составлено с:

g++ -std=c++14 test.cpp

2

c++ucs-4 utf-8

Решение

Другие решения

Других решений пока нет …

Источник

Accepted Answer

Похоже на char32_t это не то, что я хотел. Просто переезд wchar_t работал на меня. Я подозреваю, что это работает только так, как я хочу на Linux как система и Windows, это преобразование будет в UTF-16 (UCS-2) (но я не могу это проверить).

int main()
{
std::locale           utf8_to_utf32(std::locale(), new std::codecvt_utf8<wchar_t>);

// Input stream reads UTF-8 and converts to UTF-32 (UCS-4) String
std::wifstream        iFile("test.data");
iFile.imbue(utf8_to_utf32);

// Output UTF-32 (UCS-4) string converts to UTF-8 stream
std::wofstream        oFile("test.res");
oFile.imbue(utf8_to_utf32);// Now just read like you would normally.
std::wstring     line;
while(std::getline(iFile, line))
{
// UTF-32 characters are fixed size.
// So reverse is simple just do it in-place.
std::reverse(std::begin(line), std::end(line));

// UTF-32 unfortunately also has grapheme clusters (these are groups of characters
// that are displayed as a single glyph). By doing the reverse above we have split
// these incorrectly. We need to do a second pass to reverse the characters inside
// each cluster. This is beyond the scope of this question and left as an excursive
// (but I may come back to it later).
oFile << line << "\n";
}
}

Приведенный выше комментарий предполагает, что это будет медленнее, чем чтение данных, чем их перевод в строку. Итак, я сделал несколько тестов:

// read1.cpp Перевод в потоке с использованием codecvt и Locale

#include <iostream>
#include <fstream>
#include <locale>
#include <codecvt>int main()
{
std::locale           utf8_to_utf32(std::locale(), new std::codecvt_utf8<wchar_t>);

std::wifstream        iFile("test.data");
iFile.imbue(utf8_to_utf32);

std::wofstream        oFile("test.res1");
oFile.imbue(utf8_to_utf32);

std::wstring     line;
while(std::getline(iFile, line))
{
std::reverse(std::begin(line), std::end(line));
oFile << line << "\n";
}
}

// read2.cpp Перевод с использованием codecvt после чтения.

#include <iostream>
#include <fstream>
#include <locale>
#include <codecvt>
#include <string>

int main()
{
std::ifstream        iFile("test.data");
std::ofstream        oFile("test.res2");

std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8_to_utf32;

std::string     line;
std::wstring    wideline;
while(std::getline(iFile, line))
{
wideline = utf8_to_utf32.from_bytes(line);
std::reverse(std::begin(wideline), std::end(wideline));
oFile << utf8_to_utf32.to_bytes(wideline) << "\n";
}
}

// read3.cpp Использование UTF-8

#include <algorithm>
#include <iostream>
#include <string>
#include <fstream>

static bool is_lead(uint8_t ch) { return ch < 0x80 || ch >= 0xc0; }

/* Reverse a utf-8 string in-place */
void reverse_utf8(std::string& s) {
std::reverse(s.begin(), s.end());
for (auto p = s.begin(), end = s.end(); p != end; ) {
auto q = p;
p = std::find_if(p, end, is_lead);
std::reverse(q, ++p);
}
}

int main(int argc, char** argv)
{
std::ifstream        iFile("test.data");
std::ofstream        oFile("test.res3");

std::string     line;
while(std::getline(iFile, line))
{
reverse_utf8(line);
oFile << line << "\n";
}
return 0;
}

Испытательный файл был 58M Unicode японский

> ls -lah test.data
-rw-r--r--  1 loki  staff    58M Jan 28 11:28 test.data

> g++ -O3 -std=c++14 read1.cpp -o a1
> g++ -O3 -std=c++14 read2.cpp -o a2
> g++ -O3 -std=c++14 read3.cpp -o a3
>
> # This is the one using Locale in stream
> time ./a1

real    0m0.645s
user    0m0.521s
sys 0m0.108s
>
> # This is the one doing translation after reading.
> time ./a2

real    0m1.058s
user    0m0.916s
sys 0m0.123s
>
> # This is the one using UTF-8
> time ./a3

real    0m0.785s
user    0m0.663s
sys 0m0.104s

Выполнение перевода в потоке происходит быстрее, но не так значительно (не было много данных). Так что выбирайте тот, который легче читать.

1