Parse some HTML tags with a Java SAX XML parser

Parsing XML with Java is very simple. If you want to parse some HTML tags, then you just have to add a root element around those tags (to make it a valid XML structure) and then you can use the Java SAX XML parser.

Code

import java.io.StringReader;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;
 
public class NewMain {
 
  public static void main(String[] args) throws Exception {
    String html = "<h1>Headline</h1><p><b>Hello World.</b><b>This is a test.</b></p>";
    html = "<root>" + html + "</root>";
 
    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    DocumentBuilder builder = factory.newDocumentBuilder();
    Document document = builder.parse(new InputSource(new StringReader(html)));
    NodeList elementsByTagName = document.getElementsByTagName("b");
    for (int i = 0; i < elementsByTagName.getLength(); i++) {
      Node element = elementsByTagName.item(i);
      String text = element.getTextContent();
      System.out.println(text);
    }
  }
}

Result

Hello World.
This is a test.

Ein Gedanke zu „Parse some HTML tags with a Java SAX XML parser“

  1. If you wanna do some HTML with java, I recommend the jsoup project instead. SAX is almost too strict for using it as a good html parser.

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert

Diese Website verwendet Akismet, um Spam zu reduzieren. Erfahre mehr darüber, wie deine Kommentardaten verarbeitet werden.