In addition to the pipe design, I am interested in other two features of Stream: lazy evaluation and parallel stream. Lazy evaluation can be considered as the computation is delayed util it is actually needed. And the parallel stream can execute the entire pipe in parallel. The most important thing is that these two features are optimized for JVM and should be more efficient than our implementation.
When I see the lazy evaluation, I think it could be benefit to loading large files, e.g., reducing the memory consumption. However, I am not sure for that. Therefore, I design an experiment to confirm my thought. First, an interface
FileSearchStrategy is created (Code List 1) and its method can accept a file (folder), a keyword, and a result collector (
SearchResultCollector). Each implementation can use different methodologies to search the keyword in a file and put the result, a tuple <filename, line number, line content>, into the collector.
File object in Java can be pointed to a file or a folder. Thus,
AbstractSearchStrategy (Code List 2) provides an implementation for the method
search(File, String, SearchResultCollector) of
FileSearchStrategy to traverse each folder recursively. A hook method is declared for concrete classes to scan the content of a real file.
The infrastructure is completed. And the first strategy implementation
DefaultSearchStrategy (Code List 3) uses the frequently-used algorithm on text files before having Stream API: scan line by line. The result of the strategy is also the baseline of the experiment.
Then, the implementation of
AllLinesSearchStrategy uses the
readAllLines(Path) method of
Files class which coming with the NIO 2 (New I/O 2) package in Java 7 (Code List 4). In fact, the description in Java Doc says that it is convenient to read all lines in a single operation and not intended for reading in large files. Therefore, the result of the strategy should be the worst case in the experiment.
For convenient to the following parallel stream experiment, the implementation of
StreamSearchStrategy puts the pipe combination into another method (Code List 5):
scanStream(Stream, String, SearchResultCollector). And then uses the
lines() method of the
BufferedReader class to get the
Stream as input. In order to obtain the line number, the intermediate operation in the pipe is
map(Function) which transfers a string into an object consisting of a line number and a string (using
KeywordSearchResult for simplification). And
filter(Predicate) is used to filter out objects that do not match.
Since the line number is required, the
scanStream(Stream, String, SearchResultCollector) method of the
StreamSearchStrategy class uses
filter(Predicate). If the line number can be ignored, as shown in Code List 6, the
StreamSearchStrategyV2 class uses
map(Function), does it affect the result?
The size of the system memory grows quickly, and the operating system usually caches the file content. Therefore, reading the same file again is speeded up. However, the acceleration will be an impact factor. To avoid reading the file content from the cache, the experiment prepares three folders, in each, consisting of the same file structure and content listed in Table 1. For example, the sub-folder A has 10 files, 100k records in each file (size is 3.62 MB), and sub-folders B, C, etc are the same. The sub-folder G cloned a copy of the sub-folders A ~ F, and as a result, each folder has 120 files, totally sized 4.45 GB.
Table 1 - Test data
|Sub Folder||Records||File Amount||Size/File (MB)|
|G||A + B + C + D + E + F|
However, the success of the method depends on the size of the system memory. The experiment environment is listed in Table 2, and the test process is shown in Code List 7. Each strategy searches the keyword in the three prepared folders in order. When the second strategy is started to search the first folder, the content of the other two folders should be loaded into memory by the first strategy. The total size is 8.9 GB that is larger than the size of the system memory. Therefore, the content of the first folder is not in the cache. A
MemoryUsageMonitor object monitors the memory usage of JVM periodically and records the peak value in the experiment,
Table 2 - Test environment
|CPU||Intel Core i5-2400|
|HDD||Seagate ST3160815AS 160GB|
|OS||Windows 7 SP1|
Well, it is time to show the result! The result is shown in Table 3. As expected, All Lines strategy uses the most memory (over 1GB) and took the longest time (20 seconds longer than other strategies). However, the differences between Default, Stream, and Stream v2 are not significant (about 3 seconds). The memory usage of the Stream v2 and Default strategies are almost the same, but the
map(Function) seems to be a bad cost.
Table 3 - Test result of four strategies
|Strategy||Time (ms)||Memory (MB)|
AbstractSearchStrategy uses the tranditional foreach to traverse each file. Does the parallel stream benefit? or make worse? To understand that, the implementation of the
search(File, String, SearchResultCollector) method of the
AbstractSearchStrategy class is changed to Code List 8, and run the experiment again.
Suppose the parallel stream can speed up the search. However, from the result of Table 4, the parallel stream does not speed up the search, and makes wrose: huge memory consumption. I am surprised at that the memory usage of the Stream v2 strategy is more than that of the Stream strategy, and I don't known how to expain the phenomenon?
Table 4 - Test result of four strategies with parallel directories traversal
|Strategy||Time (ms)||Memory (MB)|
From the experiment result, when I/O is involved, even using parallel stream, a pipe with Stream does not speed up or use less memory. Sometimes, it is slower. If the data is not in the files on hard disk, how about the effect on processing data in the memory? Therefore, the third experiment is designed. Suppose that 100k ~ 6400k records are kept in an
ArrayList, and based on the parameter (Code List 9), use the
stream() method to obtain the input of
scanStream(Stream, String, SearchResultCollector).
The result is listed in Table 5. Note that, in the 100k column, no matter run Stream strategy first or run Stream v2 strategy first, the first run strategy always get a bad result. The cause may be the cold start up of the program. Thus, the 100k column is ignored. Starting from 200k, both the Stream and Stream v2 strategies can be beneifted by
parallelStream() to reduce the execution time a lost with. In the 6400k column, Stream v2 with parallel stream can save 136 ms. In general, the performance of the Stream v2 strategy is better than that of the Stream strategy.
Table 5 - The execution time (ms) with parallel stream
It is time to give a conclusion. First, to be benefited by the lazy evaluation, the optimization of the pipe design is required. Bad pipe design makes the performance worse. And I/O can not get lots of benefit from the lazy evaluation. The parallel stream can speed up the processing on some kinds of data sources. The I/O data source or the data source that to access may have race condition will not speed up by the parallel stream. If the data is in memory already or the data source that to access without lock, parallel stream can speed up much. However, the parallel stream also increases the memory usage, so the parallel stream should be used carefully.
ps. The source code is still under organization. When the organization is completed, the source code will be opened on GitHub.