Truncated HttpWebResponse

by Jim Nov 08, 2006 11:07 PM

Pretend for a second that ReadToEnd() isn't a bad way to parse an HTML file from code. I'm doing a little screen-scraping here, and thinking I had a problem with my regular expression. It took me a while to realize that I didn't have the whole file.

The page is only 4K, and IE and Firefox download the page just fine, showing the full source. Recalling my experiences with downloading shortcut icons, it occured to me that the Content-Length header could be incorrectly reporting the length. Sure enough, that was it.

WebClient and HttpWebRequest (which probably use the same underlying code) both return a stream that is limited to the length in the Content-Length header, regardless of the actual length of the page. I don't think there's a way to get the whole page with these classes. If I want to screen-scrape this, I'll probably need to use lower-level code that can ignore the headers. Bummer.

// Get the html
try
{                        
    strm = Client.OpenRead(page);

    // won't read the whole file if length is short in the header
    Debug.WriteLine(string.Format("Content-Length: {0}",
        Client.ResponseHeaders["Content-Length"]));

    // Doesn't matter if we wrap the stream with reader or call ReadByte() 
    // on the raw stream.
    sr = new StreamReader(strm);
    html = sr.ReadToEnd();
    Debug.WriteLine(html);
    sr.Close();
    
    /*
    // This doesn't work either
    HttpWebRequest req = (HttpWebRequest)WebRequest.Create(page);
    HttpWebResponse resp = (HttpWebResponse)req.GetResponse();
    
    strm = resp.GetResponseStream();
    sr = new StreamReader(strm);
    html = sr.ReadToEnd();
    Debug.WriteLine(html.Length);
    sr.Close();
    resp.Close();
    */                        
}

But IE and Firefox work fine! They are programmed to recover from just about any screw up that the server might make. While trying to download website icons, I found that they would recognize an icon if the headers reported an incorrect content type, among other things.

If you point IE at a URL, and a stream comes back, IE will figure out what's going on.

Tags:

Add comment


(Shows Gravatar icon; will not be displayed)

  Country flag
Click to change captcha
biuquote
  • Comment
  • Preview
Loading