Decompress Files Nodes fails with accents

lsandinop · March 11, 2024, 7:43pm

Hello

I am unzipping a .zip file that has files inside that may have accents in their names. When pasting the files into the destination folder it changes the accented letter to a question mark (?), and under these conditions the file cannot be copied and the node fails. In a previous post (2021) it says that they have it on a ticket to fix this issue, but apparently it still persists.

I’m running KNIME AP 5.2.1

Any solution?

Thanks

mlauber71 · March 12, 2024, 6:53pm

@lsandinop what you could try is use the internal Python extension and unpack the file with that:

lsandinop · March 12, 2024, 7:39pm

Thank you very much. I appreciate the proposed solution, however, where I need to implement it they have a python restriction.
Can you think of anything else?
Thanks

mlauber71 · March 12, 2024, 8:08pm

@lsandinop what operating system are you using. I think windows will not allow ? in filenames.

You can just use the Python extension without the need to install a full Python version.

lsandinop · March 12, 2024, 8:27pm

I’m using Windows. I’ve been traying to make the extraction with a java snippet but the available library to do it is the filesystem one, but has the same restriction so I think is the same the node uses. There is another one named Apache Commons Compress, but the package is not in the KNIME installation so I’m trying to know how to put it available.
Any other idea?

mlauber71 · March 12, 2024, 8:31pm

@lsandinop well the next option might be to use R. But if you do not want to install the Python extension you might not like to use R as well …

lsandinop · March 12, 2024, 8:48pm

@mlauber71 is the same problem really. I use to have problems installing extensions with that client, so I usually have to solve everything with basic nodes.

lsandinop · March 12, 2024, 10:27pm

Hello to all of you! I found the solution.

The Apache Commons Compress library provides a functionality to decompress .zip files no matter what format the filenames are in. So, I downloaded the .jar library and added it from the Java Snippet Libraries tab, which enabled the use of the library in the node. Then I just did the code and it worked!!! I leave the code in case any of you find it useful!

import org.apache.commons.compress.archivers.ArchiveEntry;
import org.apache.commons.compress.archivers.zip.ZipArchiveInputStream;
import org.apache.commons.compress.utils.IOUtils;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;

// Path of the .zip file you want to unzip
String zipFilePath = “pathToZip/file.zip”;

// Destination folder where files will be unzipped
String destDir = “PathToDestination/Destinationfolder/”;

// Create a ZipArchiveInputStream object to read the .zip archive
try (ZipArchiveInputStream zipIn = new ZipArchiveInputStream(new FileInputStream(zipFilePath))) {
ArchiveEntry entry;

 // Iterate over each entry in the .zip file
 while ((entry = zipIn.getNextEntry()) != null) {
 	String entryFileName = entry.getName();
      // Construct the full path to the file
      String entryPath = destDir + entryFileName;
      // If the entry is a directory, create the directory
      if (entry.isDirectory()) {
      	File dir = new File(entryPath);
           dir.mkdirs();
            } 
      else {
      	// If the entry is a file, extract the file
      	File outFile = new File(entryPath);
           // Create the necessary directories
           new File(outFile.getParent()).mkdirs();
           // Write the contents of the file
           try (FileOutputStream fos = new FileOutputStream(outFile)) {
           	IOUtils.copy(zipIn, fos);
                }
            catch (Exception e){
            	out_catch =e.getMessage();
            	}
            }
        }
    }
    catch (Exception e){
    	out_catch2 =e.getMessage();
    	}

mlauber71 · March 12, 2024, 10:32pm

@lsandinop can you put that in a sample workflow?

leo_woerteler · March 13, 2024, 4:23pm

@lsandinop I’m currently looking into a closely related problem where non-ASCII characters in file system paths don’t work properly. Could you execute the “Extract System Properties” node in your AP and report all properties containing encoding?

Thank you!

lsandinop · March 13, 2024, 5:32pm

Hi, is the same

I’ve been creating a workflow to upload it here and I saw that the error occurs when I zip the files with the default Windows zipper. I mannaged unzip the files but the lose the accents and the “ñ”.

I’ll upload it soon

lsandinop · March 13, 2024, 5:43pm

Hello

Here you can see a workflow with the java script to decompress zip files with files that have special characters.

While looking for the solution I realised that the problem is mainly in the files compressed with the default Windows compressor. I tried the Decompress node with zips created with other tools and it extracts perfectly the files with accents and ñ, but with the ones created by the Windows compressor I couldn’t do it with the node, and I had to adjust the Java Snippet to replace the special characters with “-” and I could write the files.

I hope this is useful.

Regards

leo_woerteler · March 13, 2024, 8:31pm

Thanks for the workflow, that helped a lot! It turns out that the Windows ZIP tool is just very weird and uses strange encodings (different ones for different OS languages!) for the file names. There is no fool-proof way to determine the right one, although some tools seem to be much better than others. Here’s more info: GitHub - Dragon2fly/ZipUnicode: Extract zip file with correct encoding. Auto detect encoding for filename that was used to archive files. Fix zip file to use UTF-8 as filename encoding.

The Decompress Files node can be configured to either use a specific encoding or guess it from the extension. The default encoding for .zip is UTF-8, but your ZIP file uses CP437 to store the file names. So if you go to the “Encoding” tab in the node settings and set the encoding to “CP437”, your ZIP should extract fine.

I know this solution is not perfect (you have to know the encoding first), but it’s very hard to find a universal solution and UTF-8 works in most cases.

lsandinop · March 13, 2024, 8:56pm

Excellent!
Many thanks! I tried all the other enconding ways and as they didn’t work I just started to look for another solution.
Thanks again

izaychik63 · March 15, 2024, 7:51pm

You can also look to this product for batch solutions

system · March 22, 2024, 7:51pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.