Cristian Amarie
2006-02-05 17:15:11 UTC
Input: URL
Desired output: true/false answer if the URL represents a text file or not.
The decision (for me, at least) occurs if the MIME type is succesuflly
detected and this is "text/plain".
(The main problem is actually a little bit larger: I'm inside a BHO's
ShowContextMenu for an anchor - the URL of the anchor element will end up in
displaying a text file or not).
For direct URLs such as http://www.rfc-editor.org/rfc/rfc1957.txt, passing
the URL to FindMimeFromData seems quite sufficient.
The trouble starts when this URL is redirected, so, for example, navigating
to http://www.rfc-editor.org/cgi-bin/rfcsearch.pl and entering a RFC number
to search (such as 1957), choosing RFC File ASCII+, "RFC Content Via" radio
is set to HTTP, the result will be the URL
http://www.rfc-editor.org/cgi-bin/rfcdoctype.pl?loc=RFC&letsgo=1957&type=http&file_format=txt
Clicking the link, IE will redirect to the first URL.
Now I tried 3 ways in order to determine if the final redirection target of
the latter URL will be a text file:
1. InternetOpen + InternetOpenUrl + InternetReadFile to read 8192 bytes.
The result was saved in a local file called C:\url (without any extension)
and the filename was passed to FindMimeFromData.
The result is E_FAIL.
(Naming the file C:\url.txt retusn S_OK, but this is obviously arbitrary
since FindMimeFromData looks also on file extension).
Call:
E_FAIL << FindMimeFromData(0, L"C:\\url", 0, 0, 0, 0, &wsMime, 0);
It looks to me that FindMimeFromData does not check in this case the file
content.
2. The same as 1, but instead of passing the local filename, the downloaded
buffer is passed to FindMimeFromData.
Call:
S_OK * << FindMimeFromData(0, 0, (LPVOID)pBuffer, cbRead, 0, 0, &wsMime,
0);
The return is S_OK and wsMime is "text/plain".
Seems ok, but see below *...
3. Call FindMimeFromData by passing directly the url.
Call:
E_FAIL << FindMimeFromData(0, pwUrl, 0, 0, 0, 0, &wsMime, 0);
Now, if the RFC search is done by HTTP, seems ok in this case.
But if in the search page the "RFC Content Via" radio is set to FTP, the
downloaded data by InternetReadFile is
<html><head><title>Click to get to RFC 1957></head>
<body bgcolor="#FFFFFF"><center>
<h1>Your browser does not support redirection.</h1>
<p>You must click ...
etc.
which for me looks more like a "text/html" MIME type, and not "text/plain".
I'm looking to determine if the content is text (and not HTML, even if also
the downloaded HTML is text, after all).
I'm reading and reading the MIME MSDN's details, but I still can't figure
why the HTML buffer above is detected as text/plain and not text/html.
I would like to perform a more accurate text/plain detection (based on a
partial WinInet downloaded content if absolutely necessary), but not before
exhausting the possibility that I did something wrong on FindMimeFromData
usage.
Looking forward for any kind of suggestion.
Many thanks,
Cristian Amarie
Desired output: true/false answer if the URL represents a text file or not.
The decision (for me, at least) occurs if the MIME type is succesuflly
detected and this is "text/plain".
(The main problem is actually a little bit larger: I'm inside a BHO's
ShowContextMenu for an anchor - the URL of the anchor element will end up in
displaying a text file or not).
For direct URLs such as http://www.rfc-editor.org/rfc/rfc1957.txt, passing
the URL to FindMimeFromData seems quite sufficient.
The trouble starts when this URL is redirected, so, for example, navigating
to http://www.rfc-editor.org/cgi-bin/rfcsearch.pl and entering a RFC number
to search (such as 1957), choosing RFC File ASCII+, "RFC Content Via" radio
is set to HTTP, the result will be the URL
http://www.rfc-editor.org/cgi-bin/rfcdoctype.pl?loc=RFC&letsgo=1957&type=http&file_format=txt
Clicking the link, IE will redirect to the first URL.
Now I tried 3 ways in order to determine if the final redirection target of
the latter URL will be a text file:
1. InternetOpen + InternetOpenUrl + InternetReadFile to read 8192 bytes.
The result was saved in a local file called C:\url (without any extension)
and the filename was passed to FindMimeFromData.
The result is E_FAIL.
(Naming the file C:\url.txt retusn S_OK, but this is obviously arbitrary
since FindMimeFromData looks also on file extension).
Call:
E_FAIL << FindMimeFromData(0, L"C:\\url", 0, 0, 0, 0, &wsMime, 0);
It looks to me that FindMimeFromData does not check in this case the file
content.
2. The same as 1, but instead of passing the local filename, the downloaded
buffer is passed to FindMimeFromData.
Call:
S_OK * << FindMimeFromData(0, 0, (LPVOID)pBuffer, cbRead, 0, 0, &wsMime,
0);
The return is S_OK and wsMime is "text/plain".
Seems ok, but see below *...
3. Call FindMimeFromData by passing directly the url.
Call:
E_FAIL << FindMimeFromData(0, pwUrl, 0, 0, 0, 0, &wsMime, 0);
Now, if the RFC search is done by HTTP, seems ok in this case.
But if in the search page the "RFC Content Via" radio is set to FTP, the
downloaded data by InternetReadFile is
<html><head><title>Click to get to RFC 1957></head>
<body bgcolor="#FFFFFF"><center>
<h1>Your browser does not support redirection.</h1>
<p>You must click ...
etc.
which for me looks more like a "text/html" MIME type, and not "text/plain".
I'm looking to determine if the content is text (and not HTML, even if also
the downloaded HTML is text, after all).
I'm reading and reading the MIME MSDN's details, but I still can't figure
why the HTML buffer above is detected as text/plain and not text/html.
I would like to perform a more accurate text/plain detection (based on a
partial WinInet downloaded content if absolutely necessary), but not before
exhausting the possibility that I did something wrong on FindMimeFromData
usage.
Looking forward for any kind of suggestion.
Many thanks,
Cristian Amarie