Discussion:
Detecting response encoding
(too old to reply)
Kim Gräsman
2005-07-17 15:55:54 UTC
Permalink
Hello,

I thought I'd use WinInet to download web page content.
I want to convert the response to Unicode (in the Win32 sense, that's UCS-2
or UTF-16, right?) so I can pass it around as a BSTR.

I need to inspect the Content-Type header to figure out the charset of the
body, don't I?
Assuming that's valid, I should be able to map iso-8859-x -> CP_ACP and UTF-8
-> CP_UTF8, and just use MultiByteToWideChar to convert the body to UCS-2.

But I suppose the charset may be just about anything described by ISO --
is there a clean translation into Windows codepages from ISO charset names?
Or does WinInet solve this transparently?

--
Best Regards,
Kim Gräsman
Stephen Sulzer
2005-07-20 08:24:57 UTC
Permalink
(UCS-2 is the 16-bit Unicode 1.1 character set; UTF-16 is a 16-bit encoding
format. Refer to http://www.unicode.org/faq/basic_q.html#23 for the
difference between them. BSTR strings are UCS-2 and should be allocated
using SysAllocString/SysAllocStringLen.)

I don't think WinInet has to worry about this particular issue; it just
supplies the "raw" bytes of the HTTP response data. Handling the
charset-to-codepage conversion is something URLMON probably handles.

Included with Internet Explorer is mlang.dll that implements a COM object
called MultiLanguage.
http://msdn.microsoft.com/library/default.asp?url=/workshop/misc/mlang/reference/objects/cmultilanguage.asp

The MultiLanguage object implements an interface called IMultiLanguage2
which provides a method called GetCharsetInfo.
IMultiLanguage2::GetCharsetInfo converts a charset identifier to a code page
id which you can then use with MultiByteToWideChar to convert to a UCS-2
widechar string.
http://msdn.microsoft.com/library/default.asp?url=/workshop/misc/mlang/reference/ifaces/imultilanguage2/getcharsetinfo.asp

Another option to consider is the WinHttpRequest COM object. It will handle
this charset-to-code page translation automatically and will provide the
HTTP response data as a Unicode BSTR string.

Hope that helps.

- Stephen
Kim Gräsman
2005-07-20 11:03:25 UTC
Permalink
Hello Stephen,
Post by Stephen Sulzer
(UCS-2 is the 16-bit Unicode 1.1 character set; UTF-16 is a 16-bit
encoding format. Refer to http://www.unicode.org/faq/basic_q.html#23
for the difference between them. BSTR strings are UCS-2 and should be
allocated using SysAllocString/SysAllocStringLen.)
Thanks for the clarification!
Post by Stephen Sulzer
Included with Internet Explorer is mlang.dll that implements a COM
object called MultiLanguage.
I found mlang.dll after posting, though I haven't had time to try it out
yet, I thought it was exactly what I was looking for. The GetCharSetInfo
method looks like it will suit my purposes perfectly.
Post by Stephen Sulzer
Another option to consider is the WinHttpRequest COM object. It will
handle this charset-to-code page translation automatically and will
provide the HTTP response data as a Unicode BSTR string.
Cool, I didn't know about that. I never mentioned but I want something that
runs on Win95 through 2003, so unfortunately it's not available on all my
target platforms. Though from what it looks like, I'll be building a similar
but simpler wrapper.

Many thanks!

--
Best regards,
Kim Gräsma

Loading...