phpoffice — Как извлечь текстовое содержимое из текстового документа с помощью PHP?

Я хочу извлечь текстовое содержимое из документа Word с помощью PHP.

Я создал новый документ Word в Microsoft Word для Mac 2011.
Редактировать: также протестировали, создав тот же документ в Microsoft Word под Windows 7.

Содержание документа

The quick brown fox jumps over the lazy dog

Я сохранил его на диск как документ Word 97-2004 (.doc).

я использую phpoffice / phpword и этот код для извлечения текста:

<?php

$source = "word.doc";

$phpWord = \PhpOffice\PhpWord\IOFactory::load($source, 'MsDoc');

$text = '';

$sections = $phpWord->getSections();

foreach ($sections as $s) {
$els = $s->getElements();
foreach ($els as $e) {
if (get_class($e) === 'PhpOffice\PhpWord\Element\Text') {
$text .= $e->getText();
} elseif (get_class($e) === 'PhpOffice\PhpWord\Section\TextBreak') {
$text .= " \n";
} else {
throw new Exception('Unknown class type ' . get_class($e));
}
}
}

print $text;

Вывод этого кода — только части текста:

The quick brown fox j

Есть ли проблема с кодом, или это какая-то проблема совместимости?

Редактировать:

Если я добавлю var_dump($els); до foreach ($els as $e) { вывод таков:

array(1) {
[0]=>
object(PhpOffice\PhpWord\Element\Text)#1265 (14) {
["text":protected]=>
string(21) "The quick brown fox j"["fontStyle":protected]=>
object(PhpOffice\PhpWord\Style\Font)#1267 (25) {
["aliases":protected]=>
array(1) {
["line-height"]=>
string(10) "lineHeight"}
["type":"PhpOffice\PhpWord\Style\Font":private]=>
string(4) "text"["name":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["hint":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["size":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["color":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["bold":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["italic":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["underline":"PhpOffice\PhpWord\Style\Font":private]=>
string(4) "none"["superScript":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["subScript":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["strikethrough":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["doubleStrikethrough":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["smallCaps":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["allCaps":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["fgColor":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["scale":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["spacing":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["kerning":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["paragraph":"PhpOffice\PhpWord\Style\Font":private]=>
object(PhpOffice\PhpWord\Style\Paragraph)#1266 (26) {
["aliases":protected]=>
array(1) {
["line-height"]=>
string(10) "lineHeight"}
["basedOn":"PhpOffice\PhpWord\Style\Paragraph":private]=>
string(6) "Normal"["next":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["alignment":"PhpOffice\PhpWord\Style\Paragraph":private]=>
string(0) ""["indentation":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["spacing":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["lineHeight":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["widowControl":"PhpOffice\PhpWord\Style\Paragraph":private]=>
bool(true)
["keepNext":"PhpOffice\PhpWord\Style\Paragraph":private]=>
bool(false)
["keepLines":"PhpOffice\PhpWord\Style\Paragraph":private]=>
bool(false)
["pageBreakBefore":"PhpOffice\PhpWord\Style\Paragraph":private]=>
bool(false)
["numStyle":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["numLevel":"PhpOffice\PhpWord\Style\Paragraph":private]=>
int(0)
["tabs":"PhpOffice\PhpWord\Style\Paragraph":private]=>
array(0) {
}
["shading":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["borderTopSize":protected]=>
NULL
["borderTopColor":protected]=>
NULL
["borderLeftSize":protected]=>
NULL
["borderLeftColor":protected]=>
NULL
["borderRightSize":protected]=>
NULL
["borderRightColor":protected]=>
NULL
["borderBottomSize":protected]=>
NULL
["borderBottomColor":protected]=>
NULL
["styleName":protected]=>
NULL
["index":protected]=>
NULL
["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
bool(false)
}
["shading":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["rtl":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["styleName":protected]=>
NULL
["index":protected]=>
NULL
["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
bool(false)
}
["paragraphStyle":protected]=>
object(PhpOffice\PhpWord\Style\Paragraph)#1266 (26) {
["aliases":protected]=>
array(1) {
["line-height"]=>
string(10) "lineHeight"}
["basedOn":"PhpOffice\PhpWord\Style\Paragraph":private]=>
string(6) "Normal"["next":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["alignment":"PhpOffice\PhpWord\Style\Paragraph":private]=>
string(0) ""["indentation":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["spacing":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["lineHeight":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["widowControl":"PhpOffice\PhpWord\Style\Paragraph":private]=>
bool(true)
["keepNext":"PhpOffice\PhpWord\Style\Paragraph":private]=>
bool(false)
["keepLines":"PhpOffice\PhpWord\Style\Paragraph":private]=>
bool(false)
["pageBreakBefore":"PhpOffice\PhpWord\Style\Paragraph":private]=>
bool(false)
["numStyle":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["numLevel":"PhpOffice\PhpWord\Style\Paragraph":private]=>
int(0)
["tabs":"PhpOffice\PhpWord\Style\Paragraph":private]=>
array(0) {
}
["shading":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["borderTopSize":protected]=>
NULL
["borderTopColor":protected]=>
NULL
["borderLeftSize":protected]=>
NULL
["borderLeftColor":protected]=>
NULL
["borderRightSize":protected]=>
NULL
["borderRightColor":protected]=>
NULL
["borderBottomSize":protected]=>
NULL
["borderBottomColor":protected]=>
NULL
["styleName":protected]=>
NULL
["index":protected]=>
NULL
["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
bool(false)
}
["phpWord":protected]=>
object(PhpOffice\PhpWord\PhpWord)#1247 (3) {
["sections":"PhpOffice\PhpWord\PhpWord":private]=>
array(1) {
[0]=>
object(PhpOffice\PhpWord\Element\Section)#1261 (16) {
["container":protected]=>
string(7) "Section"["style":"PhpOffice\PhpWord\Element\Section":private]=>
object(PhpOffice\PhpWord\Style\Section)#1262 (28) {
["orientation":"PhpOffice\PhpWord\Style\Section":private]=>
string(8) "portrait"["paper":"PhpOffice\PhpWord\Style\Section":private]=>
object(PhpOffice\PhpWord\Style\Paper)#1263 (8) {
["sizes":"PhpOffice\PhpWord\Style\Paper":private]=>
array(6) {
["A3"]=>
array(3) {
[0]=>
int(297)
[1]=>
int(420)
[2]=>
string(2) "mm"}
["A4"]=>
array(3) {
[0]=>
int(210)
[1]=>
int(297)
[2]=>
string(2) "mm"}
["A5"]=>
array(3) {
[0]=>
int(148)
[1]=>
int(210)
[2]=>
string(2) "mm"}
["Folio"]=>
array(3) {
[0]=>
float(8.5)
[1]=>
int(13)
[2]=>
string(2) "in"}
["Legal"]=>
array(3) {
[0]=>
float(8.5)
[1]=>
int(14)
[2]=>
string(2) "in"}
["Letter"]=>
array(3) {
[0]=>
float(8.5)
[1]=>
int(11)
[2]=>
string(2) "in"}
}
["size":"PhpOffice\PhpWord\Style\Paper":private]=>
string(2) "A4"["width":"PhpOffice\PhpWord\Style\Paper":private]=>
int(11870)
["height":"PhpOffice\PhpWord\Style\Paper":private]=>
int(16787)
["styleName":protected]=>
NULL
["index":protected]=>
NULL
["aliases":protected]=>
array(0) {
}
["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
bool(false)
}
["pageSizeW":"PhpOffice\PhpWord\Style\Section":private]=>
int(11906)
["pageSizeH":"PhpOffice\PhpWord\Style\Section":private]=>
int(16838)
["marginTop":"PhpOffice\PhpWord\Style\Section":private]=>
int(1417)
["marginLeft":"PhpOffice\PhpWord\Style\Section":private]=>
int(1417)
["marginRight":"PhpOffice\PhpWord\Style\Section":private]=>
int(1417)
["marginBottom":"PhpOffice\PhpWord\Style\Section":private]=>
int(1417)
["gutter":"PhpOffice\PhpWord\Style\Section":private]=>
int(0)
["headerHeight":"PhpOffice\PhpWord\Style\Section":private]=>
int(720)
["footerHeight":"PhpOffice\PhpWord\Style\Section":private]=>
int(720)
["pageNumberingStart":"PhpOffice\PhpWord\Style\Section":private]=>
NULL
["colsNum":"PhpOffice\PhpWord\Style\Section":private]=>
int(1)
["colsSpace":"PhpOffice\PhpWord\Style\Section":private]=>
int(720)
["breakType":"PhpOffice\PhpWord\Style\Section":private]=>
NULL
["lineNumbering":"PhpOffice\PhpWord\Style\Section":private]=>
NULL
["borderTopSize":protected]=>
NULL
["borderTopColor":protected]=>
NULL
["borderLeftSize":protected]=>
NULL
["borderLeftColor":protected]=>
NULL
["borderRightSize":protected]=>
NULL
["borderRightColor":protected]=>
NULL
["borderBottomSize":protected]=>
NULL
["borderBottomColor":protected]=>
NULL
["styleName":protected]=>
NULL
["index":protected]=>
NULL
["aliases":protected]=>
array(0) {
}
["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
bool(false)
}
["headers":"PhpOffice\PhpWord\Element\Section":private]=>
array(0) {
}
["footers":"PhpOffice\PhpWord\Element\Section":private]=>
array(0) {
}
["elements":protected]=>
array(1) {
[0]=>
*RECURSION*
}
["phpWord":protected]=>
*RECURSION*
["sectionId":protected]=>
int(1)
["docPart":protected]=>
string(7) "Section"["docPartId":protected]=>
int(1)
["elementIndex":protected]=>
int(1)
["elementId":protected]=>
NULL
["relationId":protected]=>
NULL
["nestedLevel":"PhpOffice\PhpWord\Element\AbstractElement":private]=>
int(0)
["parentContainer":"PhpOffice\PhpWord\Element\AbstractElement":private]=>
NULL
["mediaRelation":protected]=>
bool(false)
["collectionRelation":protected]=>
bool(false)
}
}
["collections":"PhpOffice\PhpWord\PhpWord":private]=>
array(5) {
["Bookmarks"]=>
object(PhpOffice\PhpWord\Collection\Bookmarks)#1248 (1) {
["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
array(0) {
}
}
["Titles"]=>
object(PhpOffice\PhpWord\Collection\Titles)#1249 (1) {
["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
array(0) {
}
}
["Footnotes"]=>
object(PhpOffice\PhpWord\Collection\Footnotes)#1250 (1) {
["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
array(0) {
}
}
["Endnotes"]=>
object(PhpOffice\PhpWord\Collection\Endnotes)#1251 (1) {
["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
array(0) {
}
}
["Charts"]=>
object(PhpOffice\PhpWord\Collection\Charts)#1252 (1) {
["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
array(0) {
}
}
}
["metadata":"PhpOffice\PhpWord\PhpWord":private]=>
array(3) {
["DocInfo"]=>
object(PhpOffice\PhpWord\Metadata\DocInfo)#1253 (12) {
["creator":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""["lastModifiedBy":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""["created":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
int(1483515248)
["modified":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
int(1483515248)
["title":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""["description":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""["subject":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""["keywords":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""["category":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""["company":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""["manager":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""["customProperties":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
array(0) {
}
}
["Protection"]=>
object(PhpOffice\PhpWord\Metadata\Protection)#1254 (1) {
["editing":"PhpOffice\PhpWord\Metadata\Protection":private]=>
NULL
}
["Compatibility"]=>
object(PhpOffice\PhpWord\Metadata\Compatibility)#1255 (1) {
["ooxmlVersion":"PhpOffice\PhpWord\Metadata\Compatibility":private]=>
int(12)
}
}
}
["sectionId":protected]=>
NULL
["docPart":protected]=>
string(7) "Section"["docPartId":protected]=>
int(1)
["elementIndex":protected]=>
int(1)
["elementId":protected]=>
string(6) "5d531b"["relationId":protected]=>
NULL
["nestedLevel":"PhpOffice\PhpWord\Element\AbstractElement":private]=>
int(0)
["parentContainer":"PhpOffice\PhpWord\Element\AbstractElement":private]=>
string(7) "Section"["mediaRelation":protected]=>
bool(false)
["collectionRelation":protected]=>
bool(false)
}
}

17

Решение

Попробуйте создать свой читатель, прежде чем

$source = "word.doc";
// create your reader object
$phpWordReader = \PhpOffice\PhpWord\IOFactory::createReader('MsDoc');
// read source
if($phpWordReader->canRead($source)) {
$phpWord = $phpWordReader->load($source);
... // rest of your code
}

Ответ основан на этом пример а также API документация

4

Другие решения

Вы можете извлечь текст из документа Word, используя catdoc http://www.wagner.pp.ru/~vitus/software/catdoc/

Он может быть установлен на Ubuntu с помощью

sudo apt-get install catdoc

Когда у вас есть catdoc, работающий в вашей системе, вы можете вызвать его из php с помощью shell_exec ()

<?php

$text = shell_exec('/(fullpath)/catdoc /(fullpath)/word.doc');

print $text;

?>

Обязательно замените (полный путь) фактическим путем к catdoc и вашим словом doc.

РЕДАКТИРОВАТЬ —- Дополнение

Если вы можете сохранить ваши файлы как .DOCX скорее, чем .доктор это немного проще. Ты можешь использовать расстегнуть молнию скорее, чем catdoc.

Просто замените:

$text = shell_exec('/(fullpath)/catdoc /(fullpath)/word.doc');

с

$text = shell_exec("/(fullpath)/unzip -p /(fullpath)/word.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'");

Вы можете использовать эту же технику с большинством других документов командной строки для преобразования текста. Просто замените команду в shell_exec () на команду, которая работает в вашей системе. Ты можешь проверить Как извлечь просто текст из .doc & файлы .docx? (Unix) для других альтернатив Unix / Linux

Для других вариантов PHP проверить Как извлечь текст из файла слова .doc, docx, .xlsx, .pptx php

1

По вопросам рекламы [email protected]