Skip to content

Latest commit

 

History

History
111 lines (70 loc) · 7.21 KB

README.md

File metadata and controls

111 lines (70 loc) · 7.21 KB

HugeFiles Overview

This documentation will walk you through a typical use case of this application.

Opening a file

The first thing to do with this plugin is to open a file. You can select Choose file option from the drop-down menu or the keyboard shortcut (Alt+Shift+F unless there's a conflict).

Once you open the file you can begin paging through the file.

Paging through the file

There are five ways to move through the file: forward by one chunk, back by one chunk, all the way to the last chunk of the file, and all the way to the first chunk of the file, and clicking on a chunk in the form.

The first time you begin paging, a new buffer will open containing the text in the selected chunk. Each time you move to a new chunk, the same buffer is overwritten by the chunk's text.

NOTE: If you want to page all the way to the end of a very large file (say, multiple GB), you probably want to change the settings to ignore the delimiter as shown below. If you don't, the plugin will read a significant fraction of the file as it pages to the end.

Using the form

The form (which you can open with Alt+Shift+H or the drop-down menu) lets you see how many chunks you've already opened, and also lets you open a new chunk by clicking. The eye icon indicates the chunk that is currently in view.

Changing the settings

There are 7 settings you can change:

  1. autoInferBestDelimiterAndTolerance (new in 0.2.0)
    • Default: true
    • If this is set to true, and the user does not manually set delimiter to empty or minChunk = maxChunk, the plugin will automatically do the following:
    • Determine what line separator the file is using, and automatically use that separator without changing the main settings.
    • Determine the length of the longest line in the first 8kb of the file, and set maxChunk - minChunk to be 32 * that max length.
    • Sometimes you may find that setting this to true results in improperly cutting off long lines later in the file. If so, you should turn this setting off.
  2. delimiter
    • Default: \r\n, the carriage return-linefeed that indicates a newline on Windows.
    • You can click on the paragraph symbol on the top menu bar in Notepad++.
    • If \r\n doesn't work for splitting lines, \r or \n might work.
    • If the delimiter is left blank, all the chunks will have size (minChunk + maxChunk) / 2. This will improve performance.
  3. minChunk and maxChunk
    • Default: 180,000 and 220,000 characters, respectively.
    • These are the minimum and maximum lengths that a chunk can be.
    • If minChunk equals maxChunk, the delimiter doesn't matter and all the chunks will have size (minChunk + maxChunk) / 2. This can also improve performance.
  4. previewLength
    • Default: 20 characters.
    • This is the size of the preview you get of each chunk.
    • If previewLength is 0, each chunk is labeled with the position in the document.
  5. parseJsonAsJson
    • Default: True
    • If true, files with the .json extension are automatically parsed as JSON. See below for more.
    • Parsing large JSON files can be quite slow (perhaps 100 milliseconds per megabyte) and will temporarily consume a lot of memory while the file is being parsed. Hopefully this upfront cost is justified because (a) the file doesn't stay in memory and (b) paging through the file is less likely to cause crazy lag.
  6. parseNonJsonAsJson
    • Default: False
    • If true, all files will be parsed as JSON.
    • If you want to chunk a file that doesn't have the .json extension as JSON, you should turn this setting on. Otherwise, it should be left off.

Changing any of these settings will cause you to lose any progress you made in paging through the document.

Chunking JSON files

Introduced in version 0.3.0

While it makes sense to break a text file like a big LOG or CSV into lines, this doesn't make much sense for JSON files, especially since they are frequently one-line documents.

This plugin can temporarily read a large JSON file into memory and parse it to find the best places to divide the file up into chunks such that each chunk is syntactically valid JSON. This process is slow, but hopefully the results are worth it.

To be chunked by this plugin, the JSON file must conform exactly to the original JSON specification. That means no commas after the last element in an iterable, no leading decimal points, no singlequoted strings, no NaN, etc. The one exception (because it's actually faster to relax this requirement) is that a fraction (number, then /, then another number) can have any kind of number (including floating point) as the numerator or denominator.

Here's an example of a JSON file parsed using the JSON chunking functionality.

JSON file chunked as JSON

My JsonTools plugin is able to work with any chunk of this file and create a tree view.

JSON chunk with JsonTools tree

Example of the impact of settings

A CSV file with default settings. Notice that the chunk boundary is mid-line because this file does not use \r\n as newlines.

CSV file with default settings and divided line

Clicking on the paragraph icon in the menu bar reveals that this file has Macintosh \r as newline (rendered as a black CR as opposed to LF or CR LF), so we open the settings and change our delimiter to \r.

You may notice that Notepad++ misreports the newline as Unix (LF) in the bottom right corner. That's why it's important to double-check using the paragraph icon.

See newlines and change delimiter in settings

Now when we try paging through the file, we see that the chunks are split on line boundaries.

CSV file with CR delimiter and line-split chunks

Let's look at the file one last time, this time with previews.

CSV file with CR delimiter and 30-character preview

Text search form

You can use a form to search for text in the huge file that you've chosen. This form will find matches for simple text or a regular expression.

The search form caps out at 100 search results per chunk. This is in place to avoid excessive memory consumption when searching very large files.

Example usage of search form

Clicking on a top-level node in the treeview (looks like 200053: 60 results) causes Notepad++ to open up that chunk in a buffer.

This search form works whether the file was chunked as JSON or as text.

Running the tests

This plugin has automated tests that can be run at will. You can also see them in most_recent_errors.txt in this repository.

If you try to run the tests in a very old version of Notepad++ (older than 8.0, I think) that does not put each individual plugin into a separate subfolder of the plugins folder, the tests will cause an error because the paths won't exist to the files that are read to run the tests.