Is anyone else in #rstats who’s taken 4.3 now spending inordinate amounts of... - Random

milesmcbain, 9 months ago

Is anyone else in #rstats who’s taken 4.3 now spending inordinate amounts of time dealing with invalid character encodings? 😩😩😩

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Image

Image alternative text

hrbrmstr, 9 months ago

@milesmcbain None b/c I use {stringi} for like, everything.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Danwwilson, 9 months ago

@hrbrmstr @milesmcbain tell us more please. This has chewed hrs and hrs of my time over the last few weeks. The biggest struggle is mixed encodings in the same column in the same file. Any help is welcome.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

hrbrmstr, 9 months ago

@Danwwilson @milesmcbain A light example.

I'm processing some podcast transcripts that were AI transcribed. Whisper will often stick some odd encodings in the files (esp the JSON ones for some reason).

PATH_TO_FILE_TO_CONVERT |>
stri_read_raw() |>
stri_conv(to = "UTF-8") |>
jsonlite::fromJSON()

gets me out of a world of trouble.

Now, if yours are truly “mixed" within the same column you may need to go line-by-line. (next post)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ davidbraze

hrbrmstr, 9 months ago

@Danwwilson @milesmcbain

(see attached or https://carbon.now.sh/xJFxG8bDTh69XG4wo8Ze)

that last chunk makes it easier to just work with the raw vector (save it back out, use {readr} functions directly, etc)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

hrbrmstr, 9 months ago

@Danwwilson @milesmcbain by staying at the "raw" level, you bypass a ton of potential woes b/c nothing at the R level sees the "strings" until that final encoding.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Danwwilson, 9 months ago

@hrbrmstr @milesmcbain Thanks a bunch. I will have a go with this at a couple of files that have given me particular grief in recent times and let you know if {stringi} works as well for me as it does for you 😀

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

danielmoul, 9 months ago

@Danwwilson @hrbrmstr @milesmcbain similar to Bob I have often used this:

mutate_if(is.character, ~ purrr::map_chr(.x, iconv, "UTF-8", "UTF-8", sub="")) %>% # just in case

There are a lot more invisible bad character encodings out there than I expected.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ hrbrmstr

milesmcbain, 9 months ago

@danielmoul @Danwwilson @hrbrmstr yeah so the problem with this approach is that we prefer to avoid unprintable characters due to failed conversion, since some of that data may end up getting printed.

So to accurately covert the character you need to know the source encoding, but it can vary within a single column due to client’s successive bad historical system migrations. There are tools to guess the source encoding but they can end up getting it wrong.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

hrbrmstr, 9 months ago

@milesmcbain @danielmoul @Danwwilson aye. that's why one of the examples uses the "guesser" approach first, but gosh it's horribad.

i'm kind of glad i don't have the "show this info to real humans" problems y'all do :-)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

milesmcbain, 9 months ago

@danielmoul @Danwwilson @hrbrmstr we used to be able to leave this data untouched and push blame back to client (‘that’s the representation in your system’) if they see anything weird, but now base R forces us to handle this which means it’s now our fault if anything weird shows up. Quite inconvenient!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

hrbrmstr, 9 months ago

@milesmcbain @danielmoul @Danwwilson I mean…there's always Python…

/me ducks

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Danwwilson, 9 months ago

@hrbrmstr @milesmcbain So I tried this out and it isn't pretty. That was saving back out to a file after using the last mapply() and the re-importing with readr::read_csv()

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Danwwilson, 9 months ago

@hrbrmstr @milesmcbain Further to this I did try just importing that file with an encoding of latin1 which worked pretty nicely but when in a {targets} pipeline using {fst} it turns to crap again. Using straight iconv in the terminal does ok, but apostrophe's end up as \u0092 It all just sucks 😡

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

hrbrmstr, 9 months ago

@Danwwilson @milesmcbain 😩 I guess this reinforces why I'm glad I only need to deal with machines, network protocols, and stuff malicious folk drop vs have to process human input :-)

Rly bummed it didn't help.

You'd think there'd be a "solve" for this, given that we're in 2023 already. Gimme easy processing of universal processing of human input over flying cars any day.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Danwwilson, 9 months ago

@hrbrmstr @milesmcbain so after more time than I cared to spend my solution was to iconv in the terminal and post process the remaining \u0092 characters with gsub and useBytes=TRUE. I had come across a SO response about using the file command that with —mine-encoding to have a good guess at the encoding to use. So I’ll try this approach for now and see if it’s a pattern that is broadly re-useable. But yeah, transferable data over a self driving or flying car any day.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Danwwilson, 9 months ago

@hrbrmstr @milesmcbain So I've landed on a solution that seems to work for the most part. The biggest issue was \x92 (utf8 apostrophe used on web) encoding mixed in with mostly latin1 that remained incorrectly encoded.

https://carbon.now.sh/9uSvFuGiF4uZKPOAVNlA

Things still aren't awesome but they are better. #RStats #encoding

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ hrbrmstr

milesmcbain, 9 months ago

From the news file:

“Regular expression functions now check more thoroughly whether their inputs are valid strings (in their encoding, e.g. in UTF-8).”

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Mehrad, 9 months ago

@milesmcbain
Interesting, I have two projects with 4.3.1 and in neither of them I have faced any issues. One o of them is handling some web queries, and the other one handling clinical data which are manually inserted in Excel (🙄), so far everything seems smooth to me.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Add comment