In this tutorial, I am giving an example of sed command to remove HTML tags from a file in Linux/Unix systems. Or in other words, it will convert an HTML file to a text file.
Sometimes, when we download the text from a website, we also get HTML tags, and this can be an issue when reading the data.
A standard HTML page contains many types of HTML tags. Below is a sample of an HTML file:
htmlpage.html
<html> <head> <title>Web Page Title</title> </head> <body> <p> This line contains a bold element <b>Fox Infotech</b>. And this line contains the italic text <i>Vinish Kapoor's Blog</i> This file would be converted into the plain text by using the sed command. </body> </html>
HTML tags are identified by the less than (<) and greater than (>) symbols. Most HTML tags come in pairs. One tag starts the formatting process (for example, <p> for paragraph), and another tag ends the paragraph (for example, </p> to finish a paragraph).
The following is the example of Linux sed command to remove the HTML tags from a file.
Remove HTML Tags from a File in Linux
sed 's/<[^>]*>//g ; /^$/d' htmlpage.html
Output
Web Page Title This line contains a bold element Fox Infotech. And this line contains the italic text Vinish Kapoor's Blog This file would be converted into the plain text by using the sed command.
Convert HTML to Text in Linux
The following sed command will remove the HTML tags and will send the output to a text file.
sed 's/<[^>]*>//g ; /^$/d' htmlpage.html > output.txt
Check the Output.txt
$ cat output.txt Web Page Title This line contains a bold element Fox Infotech. And this line contains the italic text Vinish Kapoor's Blog This file would be converted into the plain text by using the sed command.
See also:
- Linux: Change the Value of an Argument/Parameter in Shell Script
- How to Find Which Process Created a File in Linux?