HTML and regex
2013-10-09 16:29:47.419539+00 by
Dan Lyke
5 comments
Sigh: Saw the "don't parse HTML/XML with regex" question again: http://stackoverflow.com/quest...except-xhtml-self-contained-tags
The first answer was cute, but I was reminded that anyone who thinks you can parse HTML or XML without doing some serious regex fix-up first has never actually used any XML or HTML in the wild, where the parser answer is almost always "this isn't valid XML and I can't deal with it, sorry!", and can just ST-right-the-FU.
[ related topics:
Web development Content Management Woodworking
]
comments in ascending chronological order (reverse):
#Comment Re: HTML Parsing made: 2013-10-09 16:59:16.241933+00 by:
Jack William Bell
I once had a six week project to write a custom HTML parser/DOM in C#, meant to run on ASP. I got it
working with support for all the elements and most of the ways people could screw up HTML.
It was reasonably fast, but it used a LOT of memory because I wrote it as recursive descent parser. I coded
the basic parser and DOM in a week. The other five weeks were fixing it with special cases until it
supported all the unit tests. Tail recursion to the rescue!
#Comment Re: made: 2013-10-09 18:02:16.293182+00 by:
Dan Lyke
I used to worry about memory consumption, but now I run "top", hit "M", and see that Firefox is using 669m with 3 tabs open.
There are platforms where memory matters, but anything where C# is a reasonable development option probably isn't one of 'em.
#Comment Re: made: 2013-10-09 20:04:48.885032+00 by:
Jack William Bell
Sure, but the parser was supposed to run on a server. One of the other devs complained about the memory
usage and claimed he could code up something with Regex that would do the same thing but use a lot less
memory.
The lead dev (who was a young guy) was smart enough to say "No you can't."
#Comment Re: made: 2013-10-09 21:56:34.979492+00 by:
Dan Lyke
The Flutterby system parser uses regexes, but in a way that's suspiciously parser-like. On the other hand I've tried to rewrite it in Flex and got caught in a maze of twisty little state machines, all weird.
But even on a server, 6 engineer weeks will by a shload of RAM.
#Comment Re: made: 2013-10-09 22:25:18.687246+00 by:
meuon
I've been flamed by some engineers at metering companies for the way I parse their big fat XML files. Then I delete a few million records and re-suck in their XML for it. My limit is how fast I can insert into the SQL server.. ;)
The trick I found is to only load as XML a chunk at a time.
Technically not correct, because I am not looking at the entire XML file at the same time, but it sure works well.
Example: 50k records all chunked by meter#. Parse the XML file in chunks from <meter to /meter> and load that XML up (using simplexml in PHP usually)
and parse it. repeat until done.
I've also learned to not try to do it when they upload a file via an API. Just write that input to disk and parse it with a completely different process, usually triggered or just cron'd. This is useful because I can fix code for errant data after the fact. Accept the garbage in, sort through it for what we need later (1 to 5 minutes is real time in this world).