milesmcbain,
@milesmcbain@fosstodon.org avatar

Is anyone else in who’s taken 4.3 now spending inordinate amounts of time dealing with invalid character encodings? 😩😩😩

hrbrmstr,
@hrbrmstr@mastodon.social avatar

@milesmcbain None b/c I use {stringi} for like, everything.

Danwwilson,

@hrbrmstr @milesmcbain tell us more please. This has chewed hrs and hrs of my time over the last few weeks. The biggest struggle is mixed encodings in the same column in the same file. Any help is welcome.

hrbrmstr,
@hrbrmstr@mastodon.social avatar

@Danwwilson @milesmcbain A light example.

I'm processing some podcast transcripts that were AI transcribed. Whisper will often stick some odd encodings in the files (esp the JSON ones for some reason).

PATH_TO_FILE_TO_CONVERT |>
stri_read_raw() |>
stri_conv(to = "UTF-8") |>
jsonlite::fromJSON()

gets me out of a world of trouble.

Now, if yours are truly “mixed" within the same column you may need to go line-by-line. (next post)

hrbrmstr,
@hrbrmstr@mastodon.social avatar

@Danwwilson @milesmcbain

(see attached or https://carbon.now.sh/xJFxG8bDTh69XG4wo8Ze)

that last chunk makes it easier to just work with the raw vector (save it back out, use {readr} functions directly, etc)

hrbrmstr,
@hrbrmstr@mastodon.social avatar

@Danwwilson @milesmcbain by staying at the "raw" level, you bypass a ton of potential woes b/c nothing at the R level sees the "strings" until that final encoding.

Danwwilson,

@hrbrmstr @milesmcbain Thanks a bunch. I will have a go with this at a couple of files that have given me particular grief in recent times and let you know if {stringi} works as well for me as it does for you 😀

danielmoul,

@Danwwilson @hrbrmstr @milesmcbain similar to Bob I have often used this:

mutate_if(is.character, ~ purrr::map_chr(.x, iconv, "UTF-8", "UTF-8", sub="")) %>% # just in case

There are a lot more invisible bad character encodings out there than I expected.

milesmcbain,
@milesmcbain@fosstodon.org avatar

@danielmoul @Danwwilson @hrbrmstr yeah so the problem with this approach is that we prefer to avoid unprintable characters due to failed conversion, since some of that data may end up getting printed.

So to accurately covert the character you need to know the source encoding, but it can vary within a single column due to client’s successive bad historical system migrations. There are tools to guess the source encoding but they can end up getting it wrong.

hrbrmstr,
@hrbrmstr@mastodon.social avatar

@milesmcbain @danielmoul @Danwwilson aye. that's why one of the examples uses the "guesser" approach first, but gosh it's horribad.

i'm kind of glad i don't have the "show this info to real humans" problems y'all do :-)

milesmcbain,
@milesmcbain@fosstodon.org avatar

@danielmoul @Danwwilson @hrbrmstr we used to be able to leave this data untouched and push blame back to client (‘that’s the representation in your system’) if they see anything weird, but now base R forces us to handle this which means it’s now our fault if anything weird shows up. Quite inconvenient!

hrbrmstr,
@hrbrmstr@mastodon.social avatar

@milesmcbain @danielmoul @Danwwilson I mean…there's always Python…

/me ducks

Danwwilson,

@hrbrmstr @milesmcbain So I tried this out and it isn't pretty. That was saving back out to a file after using the last mapply() and the re-importing with readr::read_csv()

Danwwilson,

@hrbrmstr @milesmcbain Further to this I did try just importing that file with an encoding of latin1 which worked pretty nicely but when in a {targets} pipeline using {fst} it turns to crap again. Using straight iconv in the terminal does ok, but apostrophe's end up as \u0092 It all just sucks 😡

hrbrmstr,
@hrbrmstr@mastodon.social avatar

@Danwwilson @milesmcbain 😩 I guess this reinforces why I'm glad I only need to deal with machines, network protocols, and stuff malicious folk drop vs have to process human input :-)

Rly bummed it didn't help.

You'd think there'd be a "solve" for this, given that we're in 2023 already. Gimme easy processing of universal processing of human input over flying cars any day.

Danwwilson,

@hrbrmstr @milesmcbain so after more time than I cared to spend my solution was to iconv in the terminal and post process the remaining \u0092 characters with gsub and useBytes=TRUE. I had come across a SO response about using the file command that with —mine-encoding to have a good guess at the encoding to use. So I’ll try this approach for now and see if it’s a pattern that is broadly re-useable. But yeah, transferable data over a self driving or flying car any day.

Danwwilson,

@hrbrmstr @milesmcbain So I've landed on a solution that seems to work for the most part. The biggest issue was \x92 (utf8 apostrophe used on web) encoding mixed in with mostly latin1 that remained incorrectly encoded.

https://carbon.now.sh/9uSvFuGiF4uZKPOAVNlA

Things still aren't awesome but they are better.

milesmcbain,
@milesmcbain@fosstodon.org avatar

From the news file:

“Regular expression functions now check more thoroughly whether their inputs are valid strings (in their encoding, e.g. in UTF-8).”

Mehrad,
@Mehrad@fosstodon.org avatar

@milesmcbain
Interesting, I have two projects with 4.3.1 and in neither of them I have faced any issues. One o of them is handling some web queries, and the other one handling clinical data which are manually inserted in Excel (🙄), so far everything seems smooth to me.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • random
  • khanakhh
  • magazineikmin
  • osvaldo12
  • GTA5RPClips
  • mdbf
  • Youngstown
  • tacticalgear
  • slotface
  • rosin
  • kavyap
  • ethstaker
  • everett
  • thenastyranch
  • DreamBathrooms
  • megavids
  • InstantRegret
  • cubers
  • normalnudes
  • Leos
  • ngwrru68w68
  • cisconetworking
  • modclub
  • Durango
  • provamag3
  • anitta
  • tester
  • JUstTest
  • lostlight
  • All magazines