Простой HTML-дом — элементы между другими элементами

Question

Простой HTML-дом — элементы между другими элементами

Я пытаюсь написать php-скрипт для сканирования сайта и сохранения некоторых элементов в базе данных.

Вот моя проблема: веб-страница написана так:

<h2>The title 1</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>

<h2>The title 2</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>

<p class="one_class"> Some different text </p>
<p> Some other interesting text </p>

<h2>The title 3</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>

Я хочу получить только h2 и p с интересным текстом, а не p class = «one_class».

Я попробовал этот код PHP:

<?php
$numberP = 0;
foreach($html->find('p') as $p)
{
$pIsOneClass = PIsOneClass($html, $p);

if($pIsOneClass == false)
{
echo $p->outertext;
$h2 = $html->find("h2", $numberP);
echo $h2->outertext;
$numberP++;
}

}
?>

функция PIsOneClass ($ html, $ p) имеет вид:

<?php
function PIsOneClass($html, $p)
{
foreach($html->find("p.one_class") as $p_one_class)
{
if($p ==  $p_one_class)
{
return true;
}
}
return false;
}
?>

Это не работает, я понимаю почему, но я не знаю, как решить это.

Как мы можем сказать: «Я хочу, чтобы каждый р без класса был между двумя h2?»

Большое спасибо !

0

html php simple-html-dom

Решение

Другие решения

От simpleHTML dom руководство

[attribute=value]

Сопоставляет элементы с указанным атрибутом с определенным значением.
или же

[!attribute]

Соответствует элементам, которые не имеют указанного атрибута.

0

Источник

Accepted Answer

Эта задача проще с XPath, так как вы отбираете более одного элемента и хотите сохранить источник в порядке. Вы можете использовать библиотеку PHP DOM, которая включает DOMXPath, чтобы найти и отфильтровать нужные вам элементы:

$html = '<h2>The title 1</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>

<h2>The title 2</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>

<p class="one_class"> Some different text </p>
<p> Some other interesting text </p>

<h2>The title 3</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>';

# create a new DOM document and load the html
$dom = new DOMDocument;
$dom->loadHTML($html);
# create a new DOMXPath object
$xp = new DOMXPath($dom);

# search for all h2 elements and all p elements that do not have the class 'one_class'
$interest = $xp->query('//h2 | //p[not(@class="one_class")]');

# iterate through the array of search results (h2 and p elements), printing out node
# names and values
foreach ($interest as $i) {
echo "node " . $i->nodeName . ", value: " . $i->nodeValue . PHP_EOL;
}

Выход:

node h2, value: The title 1
node p, value:  Some interesting text
node h2, value: The title 2
node p, value:  Some interesting text
node p, value:  Some other interesting text
node h2, value: The title 3
node p, value:  Some interesting text

Как видите, исходный текст остается в порядке, и легко удалить ненужные узлы.

0