Sorting a very big text file is a problem in terms of memory resources. Ideally, if all data could load to memory, it was an easy task: simply sort data in memory and write it to a new file. But, there are times we need to sort a really big file and the physical memory is not enough.
I once had to write such a large file sorter, and had a hard time finding something ready on the web. Therefore, I wrote a sorter of my own. The logic behind sorting a very large file is not hard to understand once you get the point. In general, the steps needed are:
Phase 1: Split big file to several small sorted files
- Read n lines from file to memory.
- Sort lines in memory.
- Write sorted lines to a temp file.
- Repeat step 1 while reading next n lines to memory and writing sorted lines to a new temporary file.
Phase 2: Merge smaller files to a new sorted big file
- Find the smallest line among all files.
- Write line to a new file (this file is the new merged file).
- Move to the next row in file (the file that had the smallest line).
- Repeat step 1 until all lines in all files were read.
Here is the code for such a file sorter:
import java.io.*;import java.util.*;public class FileSort{protected String filenameToSort;protected String filenameSorted;protected Comparator comparator;protected int maxCapacity;public FileSort(String filenameToSort, String filenameSorted, Comparator comparator, int maxCapacity){this.filenameToSort = filenameToSort;this.filenameSorted = filenameSorted;this.comparator = comparator;this.maxCapacity = maxCapacity;}public void sort() throws IOException{BufferedReader bufferedReader = new BufferedReader(new FileReader(filenameToSort));String line = null;int fileIndex = 0;BufferedWriter bufferedWriter;do{List<String> lines = new ArrayList<String>(maxCapacity);for (int i = 0; i < maxCapacity; i++){line = bufferedReader.readLine();if (line == null){break;}else{lines.add(line);}}Collections.sort(lines, comparator);bufferedWriter = new BufferedWriter(new FileWriter(filenameToSort + ".tmp" + fileIndex++));for (String _line : lines){bufferedWriter.write(_line);bufferedWriter.newLine();}bufferedWriter.flush();bufferedWriter.close();}while (line != null);bufferedReader.close();mergeFiles(fileIndex);}public void mergeFiles(int numFiles) throws IOException{BufferedWriter bufferedWriter = new BufferedWriter(new FileWriter(filenameSorted));List<MergeFile> mergeFiles = new ArrayList<MergeFile>();for (int i = 0; i < numFiles; i++){mergeFiles.add(new MergeFile(filenameToSort + ".tmp" + i));}do{// Find smallest lineIterator<MergeFile> iterator = mergeFiles.iterator();MergeFile minMergeFile = null;while (iterator.hasNext()){MergeFile mergeFile = iterator.next();if (mergeFile.line != null) // File has more lines{if (minMergeFile == null || comparator.compare(mergeFile.line, minMergeFile.line) < 0){minMergeFile = mergeFile;}}else // No more lines in file. No need to iterate it{mergeFile.removeFile();iterator.remove();}}// Write smallest line to fileif (minMergeFile != null){bufferedWriter.write(minMergeFile.line);bufferedWriter.newLine();minMergeFile.readLine();}}while (mergeFiles.size() > 0); // As long as there are files to readbufferedWriter.flush();bufferedWriter.close();}private final class MergeFile{BufferedReader bufferedReader;String filename;String line;boolean isReadNextRow = true;public MergeFile(String filename) throws IOException{this.filename = filename;bufferedReader = new BufferedReader(new FileReader(filename));readLine();}public void readLine() throws IOException{isReadNextRow = (line = bufferedReader.readLine()) != null;}public void removeFile() throws IOException{bufferedReader.close();new File(filename).delete();}}}
Note for several things on this file sorter:
- It doesn’t limit the number of simultaneous open files. It shouldn't be much of a problem, since the maximum open files limit is pretty high and also can be configured (at least on Linux machines). If such limit exists you can always merge a group of files at a time (the size of the group is the maximum allowed open files). Then merge the new grouped files to a single file.
- This file sorter does not define the logic in which the lines are sorted.It should be supplied using a Comparator.
- The maximum number of rows to read to memory is also defined passed as a parameter to the sorter.
public class MultipleFileSort extends FileSort {private static final int BUFFER_SIZE = 1024 * 4;protected String[] filesToSort;public MultipleFileSort(String[] filesToSort, String filenameSorted, Comparator comparator, int maxReadLines) {super(filenameSorted + ".tmp", filenameSorted, comparator, maxReadLines);this.filesToSort = filesToSort;}public static void mergeFiles(String[] files, String filenameMerged) throws IOException {// Open all files for readingInputStream[] inputs = new InputStream[files.length];OutputStream os = null;try {for (int i = 0; i < files.length; i++) {inputs[i] = new FileInputStream(files[i]);}// Open file for writingos = new FileOutputStream(filenameMerged);byte[] buffer = new byte[BUFFER_SIZE];int len;for (InputStream is : inputs) {while ((len = is.read(buffer)) > 0) {os.write(buffer, 0, len);}}}finally {for (InputStream is : inputs) {FileUtils.close(is);}FileUtils.close(os);}}public void sort() throws IOException {FileUtils.mergeFiles(filesToSort, filenameToSort);super.sort();new File(filenameToSort).delete();}}
And finally example of using the MultipleFileSort class:
public static void main(String[] args) {try {MultipleFileSort multipleFileSort = new MultipleFileSort(new String[]{"in_file1.txt", "in_file2.txt", "in_file3.txt"},"out_sorted.txt", new Comparator<String>() {public int compare(String o1, String o2) {return (o1).compareTo(o2);}}, 1000);multipleFileSort.sort();}catch (Exception e) {e.printStackTrace();}}
No comments:
Post a Comment