-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error attempting to claim book from newsletter #47
Comments
Updated to the latest code from
|
Can confirm, I'll look into it. For a quickfix, create a file named "lastNewsletterUrl" containing "https://www.packtpub.com/packt/free-ebook/practical-data-analysis" in the config folder. This should stop the errors for now. |
Fixed in #48 |
nice. thanks 👍 |
Getting this error now after updating : on Version - 2.2.4
|
Yes, this is the same error. The fix should be merged tonight hopefully or you can change the line yourself. |
Sorry just noticed the comment above about created the "lastNewsletterUrl" that worked. |
Thanks @juzim , the problem above is fixed, but actually there is a bug, see the log below (I've hidden some variables/paths)
|
Actually I started the script again and it seems that the 3 book are identical and the daily ebook is ignored and I have 3 copy of the same book (newsletter). Moreover I checked on firebase and the data uploaded are mixed e.g. different filename but same author (of the newsletter). |
I think we have to reset the Packtpub instance before handling the newsletter but then this would have been broken for weeks. Could you check? Also, could you comment out spider.py:22 so we can see the contents of each handled packtpub.info? |
Yep,
and second log
|
About the first question, we probably have to reset everything before a new claim, but is this the first newsletter since they reactiveted the free ebook? The other books seems correct, I checked also the logs on heroku |
Also the field |
Can you add |
Did you by any chance delete the lastNewsletterUrl file? Because if the script tries to grab an already claimed newsletter book from the archive, it won't find it on the top position and overwrite the data for whatever book is currently on there. We have to either check if the book was already claimed (would be a nice feature anyways) or find the book in the list by name/id/etc. I'll try to look into it tomorrow but it shouldn't happen again unless you delete the file. |
I'm testing now, using docker, with the change that you suggested i.e. add lines 123
I have the following error
|
When we reset the whole packtpub we also loose the login information, so it won't work. I added a method to reset the packtpub.info but this won't solve the other issue. |
We should probably just reset in Packtpub.py
What do you think? |
No same problem, we lose the session |
I think the solution is to see if the book is already claimed before further processing it. The claim response page contains an error message that we can parse. I try to submit a patch tomorrow. |
About your solution I just don't like the fact that we have to do another request, we should be able to reset all the fields before. |
We don't need another request, the claim request returns the archive with
the error message (if the book was already claimed). So it's just another
check in get_claim that will throw an exception that prevents further
processing.
This has some small downsides but will fix your issue.
niqdev <[email protected]> schrieb am Mi., 1. Feb. 2017, 09:14:
… About your solution I just don't like the fact that we have to do another
request, we should be able to reset all the fields before.
This is just my thought, but this is were mutable state sucks (we are also
missing tests) and a purely functional approach would help us a lot. By the
way, I'm not gonna rewrite anything..haha
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#47 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEmPPA9AkfIBUjuV_QpYgZTOPphG9nYfks5rYD7rgaJpZM4Lydlb>
.
|
While I'm not a fan of flow control by exceptions, I think it works just fine in this case. Your code is solid and can surely handle quite some more beating before a rewrite would be necessary. Fixed in #50 This fix looks for a specific error message on the claim result page (which curiously only exists for the newsletter, not the dailies) which should work for now. No additional requests are made. Assuming that the first entry in the archive is always the book we are processing at the moment might cause further trouble (packtpub might switch to alphabetic sorting for example). But the only way to see if the book was already claimed in the archive right now is searching by name which is prone to errors since we parse the title from the claim URL. This would also resolve #23 Any volunteers? :) |
How would that work if you're running with |
The script would just claim the book and you can download it later manually or run it with a "downloadAll" parameter that only syncs the archive with the local folder. Notifications etc are handled on claim, not download. |
@juzim , I'll keep monitoring until next newsletter. I'll create tag |
Just to keep track, it needs further investigation. [*] fetching url... 200 | https://www.packtpub.com/packt/free-ebook/what-you-need-know-about-angular-2
[-] <type 'exceptions.IndexError'> list index out of range | spider.py@125
Traceback (most recent call last):
File "script/spider.py", line 125, in main
packtpub.runNewsletter(currentNewsletterUrl)
File "PATH/packtpub-crawler/script/packtpub.py", line 169, in runNewsletter
self.__parseNewsletterBookInfo(soup)
File "PATH/packtpub-crawler/script/packtpub.py", line 101, in __parseNewsletterBookInfo
urlWithTitle = div_target.select('div.promo-landing-book-picture a')[0]['href']
IndexError: list index out of range |
I bet they are doing a/b tests which makes it hard to reproduce. I think
claiming still works despite the error, can you confirm?
|
No, unfortunately the claiming is not working too. |
The div |
That's it?! I'll try to fix it soon but it might take till next week,
sorry.
niqdev <[email protected]> schrieb am So., 2. Apr. 2017, 11:10:
… The div promo-landing-book-picture doesn't exists
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#47 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEmPPB4hdhLsEjGopseM72lUW5HhNEgvks5rr2YVgaJpZM4Lydlb>
.
|
Looks like some of the divs has been renamed on the newsletter's landing page. I compared the page for an older book:
with the current one:
and came up with this hotfix: |
That would be great, thanks! |
Hi Guys, |
The link doesn't work for me, can you create a pull request please? Also, while there are tons of free books on the feed, they repeat a lot so we have to make sure the duplication check works. |
Is there a reason for the newsletter spreadsheet being empty even though this week's free ebook is still available under https://www.packtpub.com/packt/free-ebook/what-you-need-know-about-angular-2? |
@mkarpiarz before merge the PR can you please confirm that the email/notifications are still working? Thanks |
@mkarpiarz I removed it to prevent error messages until the issue is fixed |
@juzim - that's fine for now since there is an option to self-host the file. I haven't tested email notification yet, @niqdev, but I printed out all the variables in this
This is because the code from: packtpub-crawler/script/packtpub.py Line 101 in e604cc1
is extracting book titles based on the url inside the class where the book cover is and for this week's free ebook the relevant part looks like this:
So there is no link with the book title and instead the url points to the location of the cover image. I'll create a separate thread for this title parsing issue. |
It has successfully claimed the book from the newsletter already, but on subsequent days I'm getting the above error.
And it sends an IFTTT notification for the second one :(
The text was updated successfully, but these errors were encountered: