I’ve been meaning to spend more time with the new Java 8 features and so decided to start blogging about them. I like to do it using at least a pseudo real world case, so I decided to start with looking at streams for reading files. But I wanted there to be more of an endpoint, so I picked a project using a list of US Cities/States/ZipCodes and then trying to determine which city was most used across all states. (The input file came from GeoNames and I converted the .csv into tab delimited for easier processing.
The code in at https://github.com/bselack/java8-streams.
Looking at a standard way of reading files vs streams, the time was about 575ms. for using scanner and about 440ms. (or less) for streams.
1 2 3 4 5 6 7 8 9 10 |
private static void useStreams(String filePath) throws IOException { long startTime = new Date().getTime(); Files.lines(Paths.get(filePath)) .map(line -> line.split("\\t")) .forEach(fields -> addStateToCity(fields[3], fields[4])); System.out.println("Streams took: " + (new Date().getTime() - startTime) + " ms."); } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
private static void useScanner(String filePath) { long startTime = new Date().getTime(); try (Scanner scanner = new Scanner(Paths.get(filePath))) { while (scanner.hasNext()){ String[] lineParsed = scanner.nextLine().split("\\t"); addStateToCity(lineParsed[3], lineParsed[4]); } } catch (IOException e) { e.printStackTrace(); } System.out.println("Scanner took: " + (new Date().getTime() - startTime) + " ms."); } |
The input files is 81,000+ lines. So definately faster and easier to read using streams.
I used the same function to do the parsing and keeping track of the data, so there would be no difference in the actual processing.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
private static Function<?, ?> addStateToCity(String city, String state) { // make sure that the combination of city/state is unique // as there may be multiple zipcode records for a city CityStates cityInStates = cityStates.get(city); if (cityInStates == null) { cityStates.put(city, new CityStates(city, state)); } else { if (!cityInStates.statesContainsState(state)) { cityInStates.addState(state); cityStates.put(city, cityInStates); } } return null; } |
When I was done I wanted to output the city that was in the most states, but then thought it may be nice to see the top 10 (or x) cities. I was going to build a comparator to sort the objects when I saw I could actually use a stream as well. So I implemented that in the printMostUsedCityNames below.
1 2 3 4 5 6 7 8 9 10 |
private static void printMostUsedCityNames(int limit, boolean printStates) { cityStates.entrySet() .stream() .sorted((e2, e1) -> e1.getValue().getStateCount().compareTo(e2.getValue().getStateCount())) .limit(limit) .forEach(e -> System.out.println( printStates ? e.getValue().printCityCountWithStates() : e.getValue().printCityCount() )); } |
Resulting in
FRANKLIN is in 31 states/territories CLINTON is in 29 states/territories WASHINGTON is in 29 states/territories ARLINGTON is in 28 states/territories SALEM is in 27 states/territories CHESTER is in 27 states/territories MADISON is in 27 states/territories GEORGETOWN is in 27 states/territories MARION is in 26 states/territories SPRINGFIELD is in 26 states/territories
In the code you’ll see the method for printMostUsedCityNames is overloaded to return the states that a city is in as well (with the states sorted).
FRANKLIN is in 31 states/territories : AL, AR, CT, GA, IA, ID, IL, IN, KS, KY, LA, MA, MD, ME, MI, MN, MO, NC, NE, NH, NJ, NY, OH, PA, SD, TN, TX, VA, VT, WI, WV CLINTON is in 29 states/territories : AL, AR, CA, CT, IA, IL, IN, KY, LA, MA, MD, ME, MI, MN, MO, MS, MT, NC, NE, NJ, NY, OH, OK, PA, SC, TN, UT, WA, WI WASHINGTON is in 29 states/territories : AR, CA, CO, CT, DC, GA, IA, IL, IN, KS, KY, LA, MA, ME, MI, MO, MS, NC, NE, NH, NJ, OK, PA, TX, UT, VA, VT, WI, WV ARLINGTON is in 28 states/territories : AL, AZ, CA, CO, GA, IA, IL, IN, KS, KY, MA, MD, MN, MS, NC, NE, NJ, NY, OH, OR, SD, TN, TX, VA, VT, WA, WI, WY SALEM is in 27 states/territories : AL, AR, CT, FL, IA, IL, IN, KY, MA, MD, MI, MO, MS, NE, NH, NJ, NM, NY, OH, OK, OR, SC, SD, UT, VA, WI, WV CHESTER is in 27 states/territories : AR, CA, CT, GA, IA, ID, IL, IN, MA, MD, ME, MS, MT, NE, NH, NJ, NY, OH, OK, PA, SC, SD, TX, UT, VA, VT, WV MADISON is in 27 states/territories : AL, AR, CA, CT, FL, GA, IL, IN, KS, MD, ME, MN, MO, MS, NC, NE, NH, NJ, NY, OH, PA, SC, SD, TN, VA, WI, WV GEORGETOWN is in 27 states/territories : AL, AR, CA, CO, CT, DE, FL, GA, IA, ID, IL, IN, KY, LA, MA, MD, ME, MN, MO, MS, MT, NY, OH, PA, SC, TN, TX MARION is in 26 states/territories : AL, AR, CT, IA, IL, IN, KS, KY, LA, MA, MD, MI, MS, MT, NC, ND, NY, OH, OR, PA, SC, SD, TX, UT, VA, WI SPRINGFIELD is in 26 states/territories : AR, CO, GA, ID, IL, KY, LA, MA, ME, MI, MN, MO, NE, NH, NJ, OH, OR, PA, SC, SD, TN, TX, VA, VT, WI, WV
So that’s my first pass at using streams in multiple ways.